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Section  1 
INTRODUCTION 

Future  high  performance  parallel  computing  systems  must  rely  on  the  development  of  a  high 
throughput  three-dimensional  interconnection  system.  To  maximize  the  throughput  while 
minimizing  crosstalk  and  power  requirements,  the  Electro-Optic  Computing  Architecture  (EOCA) 
program  seeks  to  add  global  inter-wafer  optical  interconnection  capability  to  locally  connected 
parallel  processors  (e.g.,  the  Hughes  3-D  computer  and  Multiple-Chip-Module  processors).  This 
would  enable  us  to  (a)  free  the  processor  from  its  present  I/O  limitations,  allowing  efficient  parallel 
communication  with  optical  memories  and  sensors;  (b)  allow  an  efficient  coupling  of  optical 
co-processors  to  handle  fine  grain  image  processing  and  global  2-D  operations  at  throughput  rates 
exceeding  terabits/sec;  (c)  allow  efficient  sorting  operations  to  be  carried  out  through  the  use  of 
optical  shuffling,  with  expected  enhancements  of  about  100:1. 

The  objective  of  the  EOCA  program  is  to  develop  multi-function  electro-optic  interfaces  and 
optical  interconnect  units  to  enhance  the  performance  of  the  locally  connected  parallel-processor 
system  and  form  the  building  blocks  for  future  electro-optic  computing  architecture.  Specifically, 
three  multi-function  interface  modules:  Electro-Optical  Interface  (EOI),  Optical  Interconnection 
Unit  (OIU),  and  a  Space-Time  Compander  (STC)  -  will  be  developed.  A  conceptual  schematic  of 
the  EOCA  system  is  depicted  in  Figure  1-1.  Under  the  first  year  development  tasks  we  designed  all 
three  interface  modules  and  analyzed  the  component  design.  In  parallel,  we  designed  an  EOCA 
based  on  parallel  processing  and  interconnection  architecture.  A  design  analysis  of  the  system  and 
performance  comparison  were  also  carried  out. 

A  brief  overview  of  the  design/design  analysis  effort  is  given  next.  A  detailed  description  of 
this  effort  is  given  in  Sections  2  to  5  below. 

1.1  ELECTRO-OPTICAL  INTERFACE 

To  couple  locally  connected  processor  arrays  with  optical  interconnection,  it  is  critical  to 
integrate  opto-electronic  interface  devices  directly  on  silicon  substrates.  This  can  be  realized  by 
integrating  light  detectors  and  light  transmitters,  such  as  PLZT  or  III-V  multiple  quantum  well 
(MQW)  light  modulators,  on  silicon  wafers.  The  EOI  allows  bi-directional  communication  between 
a  parallel  electronic  processor  and  a  parallel  optical  interconnect  unit  over  coarse-grain  arrays.  A 
4x4  array  of  PLZT  modulators  with  a  silicon  driver  circuit  has  been  designed  for  an  EOI 
demonstration.  The  driver  can  deliver  up  to  35  V  voltage  swing,  from  a  standard  5  V  power 
supply,  to  activate  PLZT  modulators.  In  addition,  two  generations  of  electronic  bypass-and- 
exchange  switch  arrays  have  been  designed. 
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Figure  1-1.  Basic  building  block  of  electro-optic  computer  architectures. 

1.2  OPTICAL  INTERCONNECTION  UNIT 

One  of  the  key  components  of  the  proposed  EOCA  system  is  the  OIU  that  provides  the  global 
interconnections.  The  optical  transpose  interconnection  system  (OTIS)  is  a  simple,  efficient,  and 
scaleable  means  of  providing  a  transpose  interconnection  utilizing  only  a  pair  of  lenslet  arrays.  A 
detailed  analysis  of  the  OTIS  optical  system  has  been  conducted  as  part  of  design  analysis  work. 
The  geometrical  relations  for  the  optical  system  have  been  derived,  including  lens  positions, 
system  length,  and  system  volume.  The  analysis  of  the  OTIS  system  raises  some  important  issues. 
First,  the  insertion  losses  are  relatively  high  and  must  be  reduced.  Second,  the  use  of  a 
conventional  Polarizing  Beam  Splitter  (PBS)  will  create  non-uniformity  and  crosstalk.  Appropriate 
approaches,  such  as  off-normal  illumination  and  polarizing  volume  holographic  beamsplitter,  have 
been  proposed  to  address  those  issues. 

1.3  SPACE-TIME  COMPANDER 

The  key  function  of  the  STC  is  to  match  fine-grain  images  with  coarse-grain  processor  arrays. 
The  function  of  size  matching  is  performed  by  grouping  every  set  of  8x8  pixels  in  fine-grain 
images  into  a  supercell.  Each  supercell  is  then  registered  with  the  corresponding  processor  in  the 
coarse-grain  processor  arrays.  By  either  compacting  8x8  pixels  into  a  supercell  or  expanding  a 
supercell  into  8x8  pixels,  the  STC  provides  a  bi-directional  communication  between  a  fine-grain 
image  and  a  coarse-grain  processor  array. 

In  the  area  of  the  STC  effort  we  first  started  the  design  analysis  of  the  compander.  A  combined 
modulator/detector  approach  was  selected  to  allow  an  efficient  use  of  the  compander  surface  area  as 
an  interface  to  a  future  ultra  high  resolution  device.  We  then  designed  a  128x128  pixel  CCD-based 
STC.  The  heart  of  the  combined  modulator/detector  STC  is  an  8x8  charge-coupled  device  (CCD) 
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superpixel  that  can  covert  a  parallel  optical  input  into  a  serial  electrical  signal  and  vice  verse.  We 
completed  two  baseline  designs  of  superpixel  array  (2  phase  and  3  phase  serial  CCD)  for  the 
STC.  The  issues  of  device  packaging  and  signal  link  have  also  been  examined.  We  also  laid  out 
an  optical  design  to  link  a  fine-grain  2-D  image  to  superpixels  on  the  coarse-grain  processor  arrays. 

1.4  SYSTEM  ANALYSIS 

It  has  been  shown,  architecturally,  it  is  critical  to  add  global  interconnection  capability  to  the 
locally  connected  parallel  processor  arrays  to  improve  its  procerssing  speed  particularly  for  image 
processing  applications.  The  goal  of  this  study  is  to  identify  and  analyze  all  the  architectural 
implications  due  to  the  addition  of  free-space  optical  interconnects  in  the  locally  connected 
processor  array.  To  that  effect,  we  first  developed  models  for  both  electronic  and  free-space  optical 
interconnect  technologies.  These  interconnection  models  allow  us  to  assess  the  unique  advantages 
of  our  optically  augmented  3-D  computer  approach.  We  also  developed  energy  requirements  and 
time-delay  models  for  RC  limited  lines  in  the  3-D  computer,  for  Si-CMOS  receivers,  and  for  three 
types  of  light  transmitters  (VCSELs,  MQW  modulators,  PLZT  modulators).  In  addition,  we 
developed  a  model  for  terminated  lines  (transmission  lines)  as  well  as  our  global  optoelectronic 
interconnect  system.  Results  show  significant  advantages  for  free-space  optical  interconnects  vs. 
RC  limited  lines  in  the  3-D  computer  environment.  We  also  optimized  the  architecture  of  the  one 
stage-shuffle  interconnection  network  for  permutation  traffic  in  the  3-D  computer  and  developed 
the  concept  of  the  time-dilated  network.  The  peformance  of  the  EOCA  system  was  also  studied  for 
(1)  routing  and  sorting  and  (2)  FFT  applications. 
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Section  2 

ELECTRO-OPTICAL  INTERFACE 

The  EOI  provides  bi-directional  communication  between  a  parallel  electronic  processor  and  a 
parallel  optical  interconnect  unit  over  coarse-grain  arrays.  In  order  to  couple  locally  connected 
processor  arrays  with  optical  interconnection,  it  is  critical  to  realize  opto-electronic  interface  devices 
by  integrating  light  detectors  and  light  transmitters  with  silicon  wafers.  Flip-chip  bonding  is  a 
mature  and  well-developed  technique  used  extensively  for  silicon  packaging.  This  technology  is 
implemented  in  this  program  for  Si/PLZT  optoelectronic  technology.  A  high  voltage  amplifier 
circuit  has  been  designed  for  the  need  of  flip-chip  bonded  Si/PLZT  smart  modulator.  We  have 
designed  a  4x4  array  of  Si/PLZT  smart  modulator  for  an  EOI  demonstration  (see  Section  2.1).  In 
addition,  electronic  bypass-and-exchange  switches  play  an  equal  important  role  in  implementing  an 
efficient  optoelectronic  interconnection  system.  Two  generation  electronic  bypass-and-exchange 
switch  arrays  have  been  designed  and  a  16x16  (16  inputs,  16  outputs)  switch  has  been  designed 
(see  Section  2.2). 

2.1  FLIP-CHIP  BONDED  Si/PLZT  SMART  PIXEL 

In  the  hybrid  Si/PLZT  optoelectronic  technology  a  PLZT  wafer  is  used  both  as  a  support 
substrate  and  for  light  modulation.  The  electronic  driver  circuitry  is  built  on  the  silicon  chips  and 
connected  to  the  PLZT  modulator  through  metal  bumps  (as  schematically  shown  in  Figure  2-1). 
The  silicon  chips  can  be  tested  separately  before  placement  on  PLZT  to  insure  a  high  yield  process. 
The  flip-chip  process  then  mechanically  aligns  the  silicon  wafers  to  the  corresponding  PLZT 
modulators.  To  achieve  a  large  dynamic  range  PLZT  generally  requires  20-40  V  to  modulate  an 
external  light.  High  voltage  bipolar  and  MOS  processes  are  presently  incapable  of  supporting  VLSI 
circuit  densities.  On  the  other  hand,  transistor  breakdown  voltages  in  VLSI  chips  are  too  low  to 
provide  high  voltage  outputs  directly.  We  have  designed  a  special  circuit  capable  of  delivering  a 
driving  voltage  swing  up  to  35  V  from  a  standard  5  V  power  supply.  High  breakdown  voltage  of 
the  circuit  is  accomplished  using  series  connected  transistors  and  a  current-mirror  like  structure. 
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SILICON  WAFER 


Figure  2-1.  Cross  section  of  a  flip-chip  bonded  Si/PLZT  smart  pixel. 


We  have  designed  and  integrated  a  4x4  array  of  reflective  PLZT  modulators  with  the  silicon 
driver  circuit.  The  combination  of  the  driver  and  the  reflective  PLZT  modulator  produces  light 
modulation  with  a  dynamic  range  of  up  to  600: 1 .  Studies  of  speed  response  of  PLZT  9.5/65/35 
showed  the  rise  and  fall  times  to  be  less  than  10  ns  each,  fundamentally  limited  by  the  driver 
circuitry.  The  Si/PLZT  smart  pixel  is  capable  of  building  an  optical  link  with  a  bit-error  rate  (BER) 
better  than  10" 14  under  the  following  experimental  conditions: 

•  5  Mbits/sec  data  rate  •  4x4  array  of  40x40  pm  modulators 

•  Modulator  bias  at  60  V  •  Modulator  voltage  swing  at  25  V 

•  300  pW  optical  input  per  modulator  with  10  pm  spot 

The  output  power  swing  was  measured  at  100  to  200  pW  per  modulator  over  the  array. 

Alternatively,  MQW  modulators  can  also  be  flip-chip  bonded  to  3-D  silicon.  In  this  case  the 
driver  circuit  can  be  located  either  on  the  3-D  stack  or  be  monolithically  integrated  with  the 
modulator  structure  on  the  GaAs.  In  the  case  of  a  one-to-one  optical  interconnection  system,  MQW 
structures  offer  advantages  over  PLZT  modulators.  First,  the  driver  requirements  are  dramatically 
reduced  since  MQW  modulators  can  operate  simply  with  logic-level  voltage  swings.  Second  the 
same  MQW  structure  can  be  used  either  as  a  modulator  or  as  a  detector.  And  both  input  and  output 
optical  devices  will  lay  on  the  same  plane.  This  would  eliminate  the  critical  requirements  for 
aligning  the  processor  array  with  the  optoelectronics.  Further  modifications  might  include  using  the 
same  MQW  structure  as  a  bi-directional  device  (detector  and  modulator)  in  conjunction  with  a  bi¬ 
directional  driver/amplifier  circuit. 

2.2  BYPASS-AND-EXCHANGE  SWITCH  ARRAY 

To  achieve  efficient  data  communication  in  the  optoelectronic  network,  a  key  component  is  the 
electronic  switch  that  performs  the  local  routing  of  the  data.  A  switch  is  implemented  by  cascading 
2x2  bypass-and-exchange  switches,  partitioned  into  two  2-to-l  multiplexers,  called  half-switches 
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switch  (as  shown  in  Fig.  2-2).  A  pair  of  such  half-switches  is  called  partner  half-switches.  Every 
half-switch  gets  an  input  of  its  own,  plus  the  input  of  its  partner  half-switch  as  its  second  input. 
Depending  on  the  control  signal,  which  can  be  fed  externally  (global  control)  or  can  be  computed 
internally  (self-routing),  each  half-switch  transmits  one  of  the  two  possible  inputs  to  its  single 
output.  We  further  extended  the  switch  to  be  bi-directional;  the  data  flow  direction  can  be 
controlled  by  one  single  control  bit.  Thus  both  inputs  and  outputs  can  be  exchanged  in  the  same 
switch.  A  complete  16x16  switch  consists  of  log2N  =  4  stages  of  N  =  16  half-switches  each, 
for  a  total  of  N  log2N  =  64  half-switches.  The  block  diagram  of  a  16  channel  switch  is  shown  in 
Fig.  2-3. 


Figure  2-2.  Basic  building  block  of  electronic  1/2  switches. 

Two  different  half-switches  have  been  designed  for  the  OTIS.  The  first  one,  design  A,  is  a 
simple,  bi-directional  2-to-l  multiplexer.  This  was  built  as  a  proof  of  concept  for  the  2-D  layout, 
and  the  operation  of  a  half-switch  as  the  building  block  of  the  complete  switch.  The  second  design, 
design  B,  is  a  novel  self-routing  half-switch,  that  can  detect  contention,  and  drop-and-resend  data 
packets. 

2.2.1  Design  A 

The  block  diagram  of  design  A  is  shown  in  Fig.  2-4(a).(2'I)  The  half-switch  uses  an  external 
direction  signal  that  is  also  broadcast  to  every  other  half-switch  in  the  entire  switch.  This  signal 
determines  the  direction  of  the  data  flow.  The  direction  signal  is  arbitrarily  chosen  to  be  1  (dir  is  the 
direction  signal)  for  a  left  to  right  data  flow.  In  this  case,  xO  and  xl  are  the  two  inputs  and  yO  is  the 
output,  while  yl  is  floating.  Another  external  control  signal  is  sent  to  the  half-switch,  controlling 
which  input  channel  it  should  transmit  to  its  output.  Again  arbitrarily,  c  is  chosen  to  be  1  (c  is  the 
control  signal)  when  xl  is  to  be  transmitted,  and  similarly,  c  =  0  causes  xO  to  go  through.  If 
direction  is  reversed  (i.e.  dir  =  0),  then  c  =  0  causes  yO  to  be  sent  to  xO,  and  c  =  1  causes  yl  to  be 
sent  to  xO.  In  this  design,  only  four  control  bits  are  used  for  a  four  stage  switch,  labeled  cO 
through  c3  (refer  to  Fig.  2-3).  The  same  control  bit  is  sent  to  all  the  half-switches  on  the  same 
stage.  As  a  result,  only  the  final  output  destination  of  a  single  input  can  be  determined,  whereas  the 
remaining  15  inputs  go  to  the  other  15  outputs  without  contention. 
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Figure  2-4(b)  shows  the  circuit  schematics  for  design  A.  The  numbers  next  to  the  transistors 
give  width/length  in  units  of  The  half-switch  is  implemented  with  only  24  transistors.  For  a 
given  direction  signal,  half  of  the  transistors  are  not  used.  As  an  example,  if  dir  =  1,  transistors 
Ml 3  -  M24  are  not  used  since  Ml 3  and  Ml 6  block  their  final  output.  In  addition,  for  that  given 
direction  signal,  the  control  signal  determines  which  input  will  be  transmitted.  For  example,  if  c  = 
1,  M5  and  M8  block  xO,  and  xl  is  sent  through.  Similarly,  if  c  =  0,  Ml  and  M4  block  xl,  and  xO 
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is  transmitted..  Effectively,  for  a  given  direction  and  control  signal,  the  input  is  inverted  twice  to 
reach  the  output.  Due  to  this  simple  structure,  high-speed  operations  are  achievable.  For  a  2.0  pm 
CMOS  technology,  simulations  showed  a  maximum  speed  of  250  Mbits/s. 

For  one  input  to  output  connection  within  the  switch,  the  cumulative  propagation  delay  per  half 
switch  is  shown  in  Figure  2-5.  The  time  indicated  on  top  of  each  half  switch  shows  how  much  the 
control  bits  must  be  delayed  as  the  data  progresses  in  the  switch.  To  speed  up  the  overall  system 


DIR  C 


yO 


yi 


DIR  :  DIRECTION 
C  : CONTROL 


DIR  =  1  : 

C  =  0 : 

xO  ->  yO 

C  =  1  : 

xl  ->  yO 

DIR  =  0  : 

C  =  0  : 

yO  ->  xO 

C  =  1  : 

yi  ->  xO 

(a) 


Figure  2-4.  Half-Switch  (design  A),  (a)  Block  diagram  and  truth  table,  (b)  Circuit  schematics. 
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and  to  make  it  scaleable,  the  control  signals,  c0-c3,  are  fed  into  the  switch  in  a  pipelined  fashion. 
In  other  words,  the  control  bit,  cl,  which  belongs  to  the  second  stage  is  delayed  externally  by  the 
same  amount  of  time  it  takes  for  a  signal  to  propagate  through  the  first  stage  of  half-switches. 
Similarly,  the  control  bit  to  the  third  stage,  c2,  is  delayed  twice  that  amount,  and  so  on.  This 
method  ensures  that  the  control  signal  and  the  inputs  of  a  given  stage  arrive  at  the  same  time  at  the 
desired  half-switches.  Then,  the  speed  of  the  overall  system  is  directly  equal  to  the  speed  of  a 
single  half-switch.  As  the  switch  size  increases,  the  number  of  stages  and  the  total  number  of  half¬ 
switches  increase.  But,  the  overall  speed  stays  constant  since  the  propagation  delay  of  a  single 
half-switch  is  constant. 


t=0  t=2.6  ns  t=5.5  ns  t=8.4  ns 


t=0 


t=2.73  ns  t=5.60  ns  t=8.44  ns  t=  10.91  ns 

t=2.33  ns  t=4.77  ns  t=7.24  ns  t=9.39  ns 


Figure  2-5.  Cumulative  delays  during  signal  propagation  through  the  switch. 


2.2.2  Design  B 

Design  B  is  built  upon  design  A,  but  it  adds  functionality  to  the  switch  operation.  The  block 
diagram  is  shown  in  Fig.  2-6.  It  still  acts  as  a  2-to-l  multiplexer.  However,  it  has  built-in  self¬ 
routing  capability,  that  is,  the  control  bit  for  each  half-switch  is  computed  internally.  Every  input 
packet  contains  as  a  header,  the  address  of  its  desired  output  destination  (i.e.  for  N  =  256 
channels,  log2N  =  8  bits  of  address  are  needed).  As  data  packets  are  presented,  the  half-switches 
in  the  first  stage  process  the  first  bit  of  each  of  their  inputs,  and  decide  on  their  control  signal.  The 
remaining  23  bits  are  then  transmitted  untouched.  The  same  processing  is  done  in  the  next  stages 
until  the  data  packet  arrive  at  their  output  destination. 
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r  ▼  t  t 


DIR 
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cO 
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:  DIRECTION 

:  TRANSMISSION  (  =  1  FOR  DATA  BIT, 

=  0  FOR  ADDRESS  BIT  ) 

:  t  DELAYED  BY  ONE  CYCLE 

:  CONTENTION  DETECTOR 

:  CORRESPONDS  TO  "  C  "  OF  DESIGN  A 

:  =1  IF  cO  IS  A  DON'T  CARE 


Figure  2-6.  Block  diagram  of  Half-Switch  (design  B). 


As  packets  are  transmitted  through  the  switches,  two  of  them  may  have  to  use  the  same  half¬ 
switch  to  arrive  at  their  output,  and  thus,  there  is  contention  (hot  spot).  In  this  case,  the  half-switch 
transmits  one  of  the  inputs  in  a  deterministic  way,  and  drops  its  other  input.  At  the  same  time,  to 
ensure  that  the  dropped  data  is  not  lost,  a  contention  signal  is  generated  within  the  half-switch, 
where  the  blocking  happened.  This  contention  signal  propagates  in  the  direction  opposite  to  the 
data  flow,  and  follows  backwards,  the  path  that  the  dropped  packet  of  data  had  followed  up  to  that 
point.  Once  it  reaches  the  dropped  packet’s  input  buffer,  it  sets  the  input  buffer  to  resend  the  same 
packet,  so  that  all  the  information  is  eventually  routed  through  the  network. 

In  this  design,  the  direction  of  data  flow  is  again  determined  by  an  external  direction  signal 
supplied  to  all  the  half-switches.  In  addition,  an  external  transmission  signal  (t  is  the  transmission 
signal)  is  provided  to  inform  each  half-switch  that  data  transmission  is  occurring.  This  signal  is  set 
to  1  if  the  incoming  bit  is  a  data  bit,  and  is  set  to  0  if  it  is  an  address  bit.  Therefore,  for  all  the  half¬ 
switches  at  a  given  stage,  t  =  0  during  the  first  cycle  of  a  data  packet,  and  =  1  for  the  remaining  23 
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cycles.  Just  like  in  design  A,  the  transmission  signal  is  pipelined,  that  is,  delayed  by  the  same 
amount  of  time  that  the  input  takes  to  reach  that  stage.  As  a  result,  a  transmission  signal  of  0  for  1 
cycle,  and  1  for  23  cycles,  propagates  from  stage  to  stage  at  the  same  speed  that  the  data 
propagates,  with  the  0  bit  arriving  at  a  stage  when  the  control  signals  are  to  be  computed  at  that 
stage  (i.e.  the  incoming  bits  are  address  bits).  This  way,  a  single  pulse  of  t  =  0  at  the  input  buffer 
stage  enables  all  the  half-switches  in  the  entire  switch  to  know  exactly  when  to  process  the 
incoming  bits  as  their  address  bits  rather  than  data  bits.  As  each  data  packet  is  introduced  into  the 
pipeline,  first  the  address  bits  are  processed.  This  processing  of  the  header  of  a  data  packet  takes 
exactly  21og2N  cycles,  where  a  cycle  is  equal  to  the  duration  of  a  single  bit.  This  is  the  time  it  takes 
for  the  last  address  bit  (i.e.  the  (log2N)th  bit)  to  be  processed  by  the  last  stage  (i.e.  the  (log2N)th 
stage).  After  that,  the  speed  of  that  channel’s  throughput  is  equal  to  the  speed  of  a  single  half¬ 
switch,  since  the  pipeline  is  completely  filled  up  at  this  point. 
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Section  3 

OPTICAL  INTERCONNECTION  UNIT 


The  OIU  allows  a  parallel  optical  interconnect  to  another  EOI  unit,  thus  enabling 
interconnection  between  two  parallel  electronic  processors.  The  OIU  further  allows  data  shuffling 
to  enable  global  operation  (such  as  sorting)  to  be  performed  in  the  parallel  electronic  processors. 
The  optical  system  includes  the  interconnection  lenslets,  based  on  the  Optical  Transpose 
Interconnection  System  (OTIS),  and  the  elements  required  to  power-up  the  modulators. 

The  Optical  Transpose  Interconnection  System  (OTIS)  is  a  simple,  efficient,  and  scalable 
means  of  providing  a  transpose  interconnection  utilizing  only  a  pair  of  lenslet  arrays.  For  the  OIU 
applications  the  interconnection  lenslets  must  provide  the  following  requirements:  high  light 
efficiency  and  bi-directionality.  The  usefulness  of  the  transpose  interconnection  has  previously 
been  shown  for  three  architecture  classes,  namely,  shuffle-based  multistage  interconnection 
networks,  mesh  of  trees  matrix  processors,  and  hypercube  interconnections.  Section  3.1  briefly 
reviews  the  transpose  interconnection  functionality  and  Section  3.2  describes  the  application  of 
OTIS  for  shuffle  exchange  networks.  Section  3.3  describes  the  basic  OTIS  optical  system  and 
various  scalability  parameters  are  analyzed  including  system  length,  system  efficiency,  and 
crosstalk.  Variations  of  the  basic  OTIS  optical  system  for  different  applications  are  introduced, 
namely  a  generalized  system  to  match  the  geometry  of  any  arbitrary  optoelectronic  chip  layout,  and 
the  folded,  bi-directional,  and  multi-channel  systems.  Design  optimization  solutions  are  also 
introduced  in  Section  3.3.  A  simplified  version  of  OTIS  has  been  successfully  designed  and  tested 
for  the  OIU  feasibility  demonstration.  In  Section  3.4,  experimental  verification  for  a  simple  64x64 
channels  OTIS  is  presented. 

3.1  TRANSPOSE  INTERCONNECTION 

A  transpose  operation  usually  consists  of  symmetrically  interchanging  the  elements  of  a  2-D 
array  with  respect  to  the  first  diagonal  of  that  array.  In  this  project,  we  propose  to  implement  a 
particular  transpose  interconnection  between  two  arbitrary  planes  where  each  row  of  the  input  and 
output  arrays  is  arranged  in  a  raster  format  within  that  array  as  shown  in  Figure  3-1.  The 
transpose  interconnection  we  describe  is  a  one-to-one  interconnection  between  L  transmitters  {It} 
and  L  receivers  {lr},  where  It  and  lr  range  from  0  to  L-l  and  L  is  the  product  of  two  integers,  M 
and  N.  Since  L  =  M  N,  the  indices  It  and  lr  can  be  divided  into  ordered  pairs  (nt,  mt)  and  (mr, 
nr)  respectively,  where  mt  and  mr  range  from  0  to  M-l  and  nt  and  nr  range  from  0  to  N-l,  such 
that: 

It  =  M  nt  +  mt ;  lr  =  N  mr  +  nr 
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nt  =  trunc  ( lt/M) ;  mr  =  trunc  ( lr/N)  (3- 1 ) 

mt  =  It  modulo  M ;  nr  =  lr  modulo  N 

The  indices  nt  and  mr  are  referred  to  as  the  major  indices  of  the  transmitters  and  receivers, 
respectively,  mt  and  nr  are  called  the  minor  indices,  and  It  and  lr  are  referred  to  as  the  scalar 
indices.  In  the  transpose  interconnection,  (nt,  mt)  is  connected  to  (mr,nr)  if  and  only  if  mt  =  mr 
and  nt  =  nr.  Such  an  interconnection  is  called  an  MxN  transpose. 
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Figure  3-1.  Input  and  output  plane 
and  minor  index  patterns  as  well  as 
signal  pattern. 
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of  a  16x4  OTIS,  showing  major 
the  transpose  mapping  of  a  typical 


Figure  3-1  shows  the  physical  layout  of  the  transmitter  and  receiver  planes  of  OTIS  for 
M  =  16  and  N  =  4.  Both  planes  are  divided  into  regions  in  which  the  major  index  is  constant. 
The  major  pattern  of  each  plane  indicates  the  location  of  each  of  these  regions.  The  minor  pattern 
indicates  the  location  of  the  transmitters  or  receivers  within  a  region.  The  major  pattern  of  the 
transmitter  plane  is  identical  to  the  minor  pattern  of  the  receiver  plane,  except  for  a  scaling  factor 
a/M.  Likewise,  the  minor  pattern  of  the  transmitter  plane  is  similar  to  the  major  pattern  of  the 
receiver  plane,  but  is  smaller  by  a  factor  a [n  .  The  particular  order  of  indexing  within  a  pattern  will 
depend  on  the  application  architecture.  However,  the  order  of  indexing  within  the  major  (minor) 
pattern  of  the  transmitter  plane  must  match  the  order  of  indexing  within  the  minor  (major)  pattern 
of  the  receiver  plane.  The  gray  transmitters  and  receivers  on  Figure  3-1  show  a  pattern  of  signals 
mapped  from  the  transmitter  plane  to  the  receiver  plane  via  the  transpose  interconnection. 

Figure  3-2  shows  the  side  view  of  the  OTIS  optical  system  that  connects  the  two  planes 
described  above;  the  top  view  would  be  similar.  The  major  and  minor  indices  are  further  divided 
into  x  and  y  components,  so  that  the  transmitters  are  indexed  by  (mty,  mtx,  nty,  ntx)  and  the 
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(rity  ,  mty)  (mry  »  nry) 


Figure  3-2.  Side  view  of  the  OTIS  showing  imaging  through 
coupled  lenslets  arrays. 


receivers  by  (nry,  nrx,  mry,  mrx).  A  ^[n  x-Jn  array  of  lenslets  is  placed  in  front  of  the  transmitter 

plane  and  a  -4M  x  -\[M  array  of  lenslets  is  located  before  the  receiver  plane. 

The  interconnection  from  transmitter  (mt,  nt)  to  receiver  (nr,  mr)  is  traced  by  the  chief  ray 
passing  through  the  center  of  the  transmitter  lens  m  and  the  center  of  the  receiver  lens  n.  These  two 
lenses  comprise  an  imaging  system  between  the  transmitter  and  receiver  planes.  Note  that  the 
optical  imaging  system  requires  that  the  indexing  in  the  receiver  plane  be  rotated  180  degrees 
relative  to  the  transmitter  plane. 

3.2  RELATION  BETWEEN  OTIS  AND  THE  SHUFFLE  EXCHANGE 
NETWORK 

An  important  application  of  OTIS  is  in  support  of  multistage  interconnection  network  (MIN) 
architectures  based  on  k-shuffles.  A  k-shuffle  MIN  functionally  provides  full  connectivity  between 
L  input  channels  and  L  output  channels  in  logkL  stages  of  optoelectronic  switch  planes  and 
(logkL)-l  stages  of  optical  k-shuffles.  One  optoelectronic  switch  has  k  optical  inputs  (receivers) 
and  k  optical  outputs  (transmitters)  and  provides  crossbar  equivalent  electronic  switching  between 
its  k  channels. 

In  a  k-shuffle,  the  receiver  lr  to  which  It  is  connected  is  given  by 

lr  =  k  {It  modulo  (L/k)}  +  trunc  (  klt/L)  (3-2) 
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If  M  is  set  equal  to  L/k  and  N  is  set  equal  to  k,  then 

lr  =  N  tnt  +  nt  (3-3) 

Equating  Equations  (3-2)  with  (3-3)  for  arbitrary  values  of  It  implies  that  nt  =  nr  and  mt  =  mr. 
Therefore,  the  k-shuffle  is  equivalent  to  a  (L/k)x  k  transpose.  Note  that  an  output  switch  in  a 
k-shuffle  OTIS-MIN  corresponds  to  a  major  index  region  and  that  the  number  of  major  regions  in 
the  source  plane  is  equal  to  k.  An  interesting  application  of  OTIS-MIN  arises  when  k  =  (M  N) 1/2 
since  the  number  of  stages  for  routing  becomes  a  constant: 

logkL  =  logV^MN  =  2  (3-4) 

In  this  case,  only  two  stages  of  optoelectronic  switches  and  one  stage  of  optics  are  required  to 
perform  full  routing  between  source  and  destination  planes,  independent  of  the  total  number  of 
communication  channels  in  the  MIN. 


3.3  OTIS  SYSTEM  ANALYSIS 

The  analysis  of  the  OTIS  optical  system  is  provided  in  this  Section.  The  geometrical  relations 
for  the  optical  system  are  derived  including  lens  positions,  system  length,  and  system  volume. 
The  light  throughput  efficiency  and  the  insertion  losses  of  the  OTIS  optical  system  are  then 
analyzed.  All  these  relations  are  derived  for  the  basic  OTIS  optical  system  and  extended  to  the 
generalized  system  for  arbitrary  optoelectronic  chip  layouts. 


3.3.1  Optical  System  Characterization  and  Derivations 

In  order  to  achieve  the  desired  transformation  the  lenses  must  be  separated  by  a  distance  Dt  for 
the  transmitter  lenses  and  Dr  for  the  receiver  lenses  (see  Fig.  (3-2)).  Both  Dt  and  Dr  can  be  easily 
derived  using  simple  geometric  (similar  triangles)  relations: 


_  /-  VMN  -  1 
VN  +  1 
_  /  i/MbT  - 1 
VM  + 


)A 
)  A 


(3-5) 


Note  that  the  lens  spacings  Dt  and  Dr  are  also  the  clear  apertures  of  the  lenses  above  the 
transmitters  and  receivers  planes  respectively.  The  positions  (xi,  yj)  of  the  center  of  the  lenses  in 
these  planes  are  then  given  by: 


( Xj ;  yj ) 


±(l+i)(®p-L 

VN  +  1 


A  .  +  (i  +  i)  (  j  A_ 

2’  JM  VN  +  1  ;  2  J 


(3-6) 


where  i,j  =  0,  ...  ,  VN/2  - 1.  In  the  case  of  a  system  as  shown  in  Fig.  (3-2),  these  would  be  the 
coordinates  in  the  first  lens  (transmitter  lenses)  plane.  Those  for  the  second  plane  (receiver  lenses) 
can  be  derived  using  Eq.  (3-6)  and  replacing  N  with  M  and  M  with  N  respectively.  It  can  also  be 
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shown  that  the  magnification  of  the  transmitter  lenses  (X)  must  be  the  inverse  of  the  magnification 
of  the  receiver  lenses  for  an  unit  system  magnification: 

^2  _  d3  _  v  _  VMN  -  1  /o  7^ 

di  d4  VM  +  VN+2  1  ' 


This  leads  to  the  following  relation  between  the  focal  lengths  ft  and  fr  of  the  transmitter  and 
receiver  lenses  respectively: 


f  _  VN  +  1  f 

fr_VM7Tft 


(3-8) 


By  combining  Eq.  (3-5)  and  Eq.  (3-8)  it  can  be  shown  that  the  f-numbers  (f#)  of  the 
transmitter  and  receiver  lenses  have  to  be  the  same.  It  is  then  possible  to  derive  the  total  length  (d) 
of  the  system  as  a  function  of  the  system  size  (number  of  channels),  the  source  spacing  in  the 
transmitter  plane  and  the  f-number  of  the  lenses  used  in  the  system: 

d  =  (VN  +  l)(VM+l)Af#  (3-9) 


The  aspect  ratio  (AR)  of  the  system  can  also  be  estimated.  It  is  equal  to  the  ratio  of  the  system 
length  to  the  transmitter  (receiver)  plane  width: 


AR  = 


+  -Lr  +  -J=r  + - - 


VM  VN  VMN 


f# 


(3-10) 


It  is  interesting  to  notice  that  the  aspect  ratio  of  the  OTIS  system  is  asymptotically  equal  to  the  f- 
number  of  the  lenses  of  the  system.  Finally,  the  volume  of  the  system  can  be  approximated  to: 

V  =  (  NM  )3/2  A3  f#  (3-11) 


These  equations  fully  characterize  the  geometrical  behavior  of  the  OTIS  optical  system, 
assuring  that  the  lenses  perform  the  desired  transformation.  This  optical  system  is  advantageous  in 
terms  of  complexity  and  alignment  since  only  two  planes  of  optical  elements  are  required  to 
interconnect  two  optoelectronic  chips.  Power  considerations  such  as  insertion  losses,  system 
efficiency  and  crosstalk  are  discussed  in  the  next  Section. 


3.3.2  System  Power  Considerations 

As  can  be  seen  in  Fig.  (3-2),  the  OTIS  optical  system  does  not  achieve  100%  light  efficiency. 
This  is  because  the  light  sources  of  one  given  major  pattern  in  the  transmitter  plane  do  not 
illuminate  only  their  dedicated  lens.  Some  of  the  emitted  light  is  lost  on  the  edges  of  the  chips 
where  there  is  no  transmitter  lens  and  some  other  illuminates  the  neighboring  transmitter  lenses 
creating  unwanted  images  (spots)  in  the  receiver  plane.  However,  it  can  be  shown  by  ray  tracing 
that  these  unwanted  images  are  not  a  direct  source  of  crosstalk(3"2)  since  they  always  he  outside  the 
chip  area  in  the  receiver  plane. 
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In  the  following,  it  is  assumed  that  all  transmitters  in  a  given  major  pattern  fully  illuminate  their 
dedicated  lens  and  that  the  transmitters  located  at  the  edge  of  the  transmitter  plane  do  not  over¬ 
illuminate  their  dedicated  lens  (Fig.  3-3).  It  is  then  possible  to  estimate  the  worst-case  collection 
efficiency  (r|wc)  of  this  lens  system  by  calculating  a  simple  area  ratio.  r|Wc  can  be  approximated  by 
the  ratio  of  the  amount  of  light  emitted  by  an  edge  transmitter  and  captured  by  its  dedicated  lens  to 
the  total  area  illuminated  by  this  transmitter  at  the  lens  plane: 

M3*)2=111%  (3_12) 

Another  concern  for  such  a  system  is  the  coupling  light  efficiency  between  the  transmitter  lens 
plane  and  the  receiver  lens  plane.  It  turns  out  that  in  the  OTIS  system  there  will  be  no  coupling 
losses  (in  first  approximation).  This  is  due  to  the  design  constraint  that  the  f-numbers  of  the 
transmitter  and  receiver  lenses  have  to  be  matched  and  that  all  the  light  transmitted  by  a  transmitter 
lens  will  illuminate  exactly  the  corresponding  receiver  lens.  Therefore,  if  reflection  losses  and 
diffraction  losses  at  the  receiver  lenses  and  at  the  detectors  are  neglected,  the  worst-case  light 
efficiency  (for  an  edge  modulator)  of  the  system  will  always  be  11.1  %. 


Figure  3-3:  Off-axis  illumination  and  vignetting  for  a  light  transmitter  located  at  the  edge  of 
the  array 

The  optical  implementation  of  OTIS  can  be  achieved  with  either  refractive  or  diffractive  optics. 
Utilization  of  diffractive  optics  in  the  system  will  decrease  the  light  efficiency  by  a  factor 
proportional  to  the  number  of  phase  levels  used  in  the  diffractive  optics  fabrication/3'35 
Furthermore,  some  of  the  incident  light  on  the  diffractive  elements  will  diffract  into  higher  order 
terms,  thereby  affecting  the  Signal-to-Noise  Ratio  (SNR)  of  the  system.  However,  these  effects 
will  not  be  developedunder  this  program  since  they  have  already  been  studied  elsewhere/3'4,3'55 
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3.3.3  Generalized  System 

The  OTIS  optical  system  described  in  the  previous  sections  assumes  that  the  transmitters  and 
the  receivers  on  the  optoelectronic  integrated  circuits  lay  on  a  regular  square  grid  and  that  all  the 
nodes  of  that  grid  are  used.  In  practice,  due  to  the  electronic  layout,  it  could  prove  useful  for  the 
optical  system  to  be  able  to  accommodate  optoelectronic  chips  with  different  arrangements. 


TCrA 


(3a) 


Ar 


Cr  Ar 


I 


Figure  3-4.  OTIS  system  for  accommodating  arbitrary  optoelectronic  layouts. 

Figure  3-4a  shows  one  possible  case  where  there  is  a  gap  between  the  major  patterns  in  both 
the  transmitter  (Ct)  and  receiver  (Cr)  planes.  These  gaps  can  be  used  for  wiring  electrical  inputs 
and  outputs  on  and  off  the  chips,  for  bringing  power  lines  to  the  different  locations  on  the  chips,  or 
to  accommodate  the  fact  that  the  optoelectronic  transmitter  and  receiver  planes  are  built  using  multi¬ 
chip  module  technology.  An  addition  to  that  case  is  shown  in  Fig.  3-4b  where  the  spacing 
between  nodes  in  the  transmitter  plane  (At)  is  different  from  the  spacing  between  nodes  in  the 
receiver  plane  (Ar).  As  in  the  previous  section,  it  can  be  shown  that  the  spacing  between  the 
transmitter  lenses  (Dt)  and  the  receiver  lenses  (Dr)  is: 
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Calculations  similar  as  in  the  previous  section  lead  to  the  following  relation  between  the  f-numbers 
of  the  transmitter  lenses  (f#t)  and  the  receiver  lenses  (f#r): 


(  Vm  +  q  +  i) 

1  +  —  ( VN  +  Cr ) 

f#t  _ 

At 

J 

!#r  (VN+cr+i) 

VM  +  Ct  +  ^ 

L  At  j 

It  is  interesting  to  note  that  the  f-numbers  of  both  lens  planes  will  be  different  only  when  the 
node  spacing  in  the  transmitter  and  receiver  planes  is  different  (At  ^  Ar).  However,  the  overall 
magnification  of  the  generalized  system  can  be  shown  to  be  still  1  while  the  magnification  of  the 
first  lenslet  array  (X)  is: 


v  _  d2  _  d^_ 
di  d4 
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(3-15) 


The  length  of  the  system  can  be  derived  from  Eq.  (3-13),  Eq.  (3-14),  and  Eq.  (3-15),  as  a 
function  of  f#t  or  f#r: 


d  =  (VM  +  Ct  +  1) 


1  +  —  ( VN"  +  Cr ) 

At 


At  f#t 


or 
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(3-16) 


if  At  =  Ar,  the  two  relations  in  Eq.  (3-14)  are  identical  since  the  f-numbers  are  equal.  It  can  be 
shown  that  the  volume  of  the  system  still  grows  linearly  with  the  number  of  channels  in  the  system 
and  the  node  spacing: 

V  ( NM )  2  At  f#t  (3-17) 


19 


The  last  metric  of  interest  is  the  light  collection  efficiency  as  introduced  in  the  previous  section. 
It  can  be  shown  that  the  worst-case  efficiency  for  the  generalized  system  is: 


(VN-1) 

VM  +  Ct  +  ^ 
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(3-18) 


By  simply  trading  in  chip  area  (increased  receiver  spacing  or  “gaps”  between  major  patterns),  the 
light  collection  efficiency  of  the  system  can  be  improved. 

In  the  following  sections,  in  order  to  simplify  without  loss  of  generality,  M  and  N  will  be  set 
equal  (symmetrical  system)  and  the  product  L  =  M2  =  N2  will  be  assumed  to  be  a  power  of  4.  Note 
that  this  type  of  system  is  particularly  useful  in  the  case  of  k-shuffle  applications  with  k  =  M  =  N . 
In  this  case,  the  number  of  stages  for  routing  or  sorting  of  the  k-shuffle  multistage  interconnection 
network  becomes  a  constant  independent  of  L  (L  =  MN  is  the  size  or  number  of  channels  of  the 
network).  For  routing  applications,  only  2  electronic  switching  stages  and  one  optical 
interconnection  stage  will  be  required  to  achieve  full  connectivity  for  any  L. 

3.3.4  Folded  System 

The  previously  described  systems  perform  the  transpose  interconnection  but  leave  the 
transposed  result  on  a  different  array  than  the  original  one.  In  some  cases,  it  can  be  desirable  to 
have  both  the  original  matrix  and  its  transpose  on  the  same  array.  Figure  (3-5)  shows  a  possible 
optical  system  that  achieves  the  matrix  transposition  on  a  single  optoelectronic  chip.  The  light 
emitted  by  the  transmitters  is  first  imaged  through  the  OTIS  lenses  onto  an  intermediate  image 
plane  shown  by  a  dashed  line  on  Fig.  (3-5).  A  single  imaging  lens  and  a  folding  mirror  are  then 
used  in  a  one-to-one,  4f  configuration  to  flip  this  intermediate  image  back  onto  itself.  Finally,  the 
light  goes  through  the  OTIS  lenses  again  to  the  detectors  on  the  chip  in  order  to  complete  the 
transpose  operation. 

The  interconnection  function  achieved  in  this  case  is  exactly  that  of  a  matrix  transpose  operation 
without  any  image  inversion  or  rotation,  leaving  the  elements  of  the  first  diagonal  of  the  matrix 
unchanged  and  interchanging  all  the  other  elements  of  that  matrix  with  respect  to  the  first  diagonal. 
It  is  important  to  note  that  only  symmetrical  systems  (N  =  M)  can  be  folded  in  this  manner.  It  can 
be  seen  on  Fig.  (3-5)  that  the  apertures  of  the  OTIS  lenses  in  the  folded  configuration  are  no 
longer  symmetrical  with  respect  to  their  center  and  that  the  area  between  the  lenses  is  actually  used 
to  block  the  incident  or  reflected  light.  This  is  required  to  achieve  the  transpose  interconnection 
operation  without  any  crosstalk  between  adjacent  communication  channels.  The  size  of  these  beam 
blocks  (bb)  can  be  easily  calculated  and  reduces  to  bb  =  X  8,  where  X  is  the  magnification  of  the 
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OTIS  transmitter  lenses  and  8  is  the  spacing  between  modulators  and  detectors  on  the 
transmitter/receiver  plane. 

This  configuration  reduces  the  light  collection  efficiency  and  the  light  coupling  efficiency  of  the 
system  since  the  effective  apertures  of  the  OTIS  lenses  are  reduced  and  light  is  blocked  on  both  the 
forward  path  from  the  transmitters  to  the  mirror  and  the  reflected  path  from  the  mirror  to  the 
receivers.  In  this  case  the  worst-case  light  efficiency  of  the  system  (Eq.  (3-12))  reduces  to: 


tlwc  — 


Dt  -  2  bb  \2 
3  Dt  +  5  / 


(3-19) 


where  Dt  is  the  spacing  between  the  centers  of  adjacent  lenses.  In  order  to  minimize  light  losses  in 
such  a  folded  system,  the  receivers  and  transmitters  should  be  laid  out  as  close  possible  from  each 
other  on  the  optoelectronic  chip;  i.e.  5  should  be  as  small  as  physically  possible  compared  to  At- 
For  multistage  interconnection  network  applications,  the  folded  configuration  only  uses  a 
single  optoelectronic  plane  and  a  single  optical  interconnection  plane,  thereby  reducing  the  amount 
of  hardware  in  the  system.  However,  such  issues  as  buffering,  time-multiplexing,  and  pipe  lining 
of  the  data  become  of  importance  and  must  be  evaluated(3‘6)  before  a  single  chip  implementation  can 
be  envisaged. 


Intermediate  1-1  imaging  Mirror 

Image  plane  lens 


Optoelectronic 

Chip  0  Transmitter 

A  Receiver 


Figure  3-5.  Folded  (reflective)  OTIS  configuration. 
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3.3.5  Bi-directional  system 

An  important  feature  of  the  Optical  Transpose  Interconnection  System  described  in  the  previous 
sections  is  that  it  can  be  made  bi-directional  easily.  This  could  prove  useful  in  a  number  of 
applications.  For  interconnection  networks  in  a  packet  switching  configuration  it  allows  blocking 
information  to  be  sent  back  to  the  previous  stages  of  the  network  in  order  to  establish  the  correct 
routing  path  in  the  network.  OTIS  can  also  be  used  in  a  matrix-vector  multiplier  configuration 
(OTIS-MT)  in  support  of  neural  network  applications.  In  this  case,  the  bi-directionality 
characteristic  of  OTIS  can  be  used  to  implement  an  on-chip  fully  parallel  back  propagation  learning 
system. 

The  implementation  of  the  bi-directional  OTIS  optical  system  is  shown  on  Fig.  (3-6).  Plane  1 
and  Plane  2  are  now  both  transmitter/receiver  planes.  Plane  2  is  actually  rotated  180°  with  respect 
to  plane  1  so  that  a  transmitter  on  plane  1  faces  a  receiver  on  plane  2  and  vice-versa.  Similarities 
with  the  folded  system  from  Fig.  (3-5)  can  be  observed  in  the  fact  that  beam  blocks  between 
adjacent  lenses  are  required  in  order  to  avoid  crosstalk.  These  beam  blocks  are  located  on  the  lower 
half  of  the  transmitter  lenses  and  the  upper  half  of  the  receiver  lenses.  As  it  was  described  in  the 
previous  section,  the  light  collection  efficiency  and  the  light  coupling  efficiency  are  reduced.  It  can 
be  proven  that  the  beam  blocks  in  this  case  are  the  same  size  as  in  the  bi-directional  system  case 
and  therefore,  the  worst-case  light  efficiency  of  the  system  is  also  given  by  Eq.  (3-19). 

3.3.6  Multi-channel  System 

A  final  interesting  feature  of  the  OTIS  system  is  that  it  can  become  a  multi-channel  system 
where  each  node  has  more  that  a  single  pair  of  transmitter/receiver.  This  is  useful  in  the  case  where 
more  than  bit  serial  communication  is  required  in  a  system.  For  example,  it  could  be  required  to 
have  word  parallel  communication  where  each  word  would  have  16  bits.  In  this  case,  each  node 
would  have  4x4  transmitter/receiver  pairs.  It  could  also  be  useful  in  the  case  of  interconnection 
networks  where  one  transmitter/receiver  pair  would  be  used  for  data  communication  and  one  other 
for  activation/blocking  information.  In  this  case,  2  transmitter/receiver  pairs  are  required  within 
each  node.  Finally,  it  could  also  prove  useful  to  have  2  transmitter/receiver  pair  for  each 
communication  channel  in  order  to  implement  differential  readout  (dual-rail  logic)  in  the  case  where 
the  intrinsic  signal-to-noise  ratio  of  the  system  is  not  sufficient  to  provide  low  communication  bit 
error  rates. 
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Plane  1 


O  Transmitter  plane  2 

A  Receiver 


Figure  3-6:  Bi-directional  OTIS 


Figure  3-7  shows  a  1-D  representation  of  a  multi-channel  system  with  2x2  transmitter/receiver 
pairs  within  each  node.  As  before,  plane  2  is  actually  rotated  180°  with  respect  to  plane  1  so  that  a 
transmitter  on  plane  1  faces  a  receiver  on  plane  2  and  vice-versa  (Note  that  the  multi-channel 
system  is  also  bi-directional).  In  this  case,  beam  blocks  are  also  required  around  the  lenses  to 
achieve  the  transpose  interconnection  without  any  crosstalk.  The  size  of  these  beam  blocks  can  be 
easily  derived  and  reduces  to: 

bb  =  X  (  2  K  -  c )  5  (3-20) 


where  X  is  the  magnification  of  the  transmitter  lenses,  K  is  the  number  of  transmitter/receiver  pairs 
within  each  node,  8  is  the  spacing  between  adjacent  transmitter  and  receiver  and  c  is  a  constant  that 
can  take  two  different  values  (see  below).  Two  different  beam  blocks  of  different  sizes  located  at 
the  top  and  the  bottom  of  the  OTIS  lenses  are  needed.  Which  beam  block  is  used  at  the  top  or 
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bottom  of  the  lenses  will  change  from  the  plane  1  lenses  to  the  plane  2  lenses  (see  Fig.  3-7).  For 
the  plane  1  lenses  (on  the  left  on  Fig.  3-7),  bb  can  be  calculated  with  c  =  1  for  the  beam  blocks  at 
the  top  of  each  lens  and  c  =  3  for  those  at  the  bottom.  In  the  second  lens  plane  the  beam  block 
positions  will  be  inverted  for  the  same  values  of  c.  In  this  configuration,  the  light  efficiency  of  the 
system  is  also  reduced  and  it  can  be  calculated  to  be: 


■n 


wc  ' 


Dt  -  2  bbi  -  bb2 


3  Dt  +  (  2  K  -  1 )  5 


(3-21) 


As  previously,  it  can  be  seen  that  it  is  important  to  have  the  transmitters  and  detectors  located 
toward  the  centers  of  each  node  (minimize  all  the  terms  of  Eq.  (3-21)  except  Dt)  in  order  to 
optimize  (maximize)  the  worst-case  light  efficiency  of  the  system. 


Figure  3-7.  Multi-channel  OTIS  implementation. 
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3.3.7  OTIS  Issues 

The  analysis  of  the  OTIS  system  raises  some  important  issues.  First,  the  insertion  losses  are 
too  high  and  must  be  reduced.  Second,  the  use  of  a  Polarizing  Beam  Splitter  (PBS)  will  create 
problems  in  terms  of  uniformity  and  crosstalk. 

3.3.7. 1  Array  Illuminator 

As  shown  in  Section  3.3.2,  the  collection  efficiency  of  the  system  can  be  as  low  as  10%.  This 
is  not  acceptable  and  improvements  have  to  be  made.  A  possible  improvement  is  shown  in  Fig.  3-8 
where  all  the  modulators  in  the  system  are  illuminated  at  the  correct  angle  so  that  all  the  light  that 
passes  through  a  modulator  will  be  collected  by  the  right  OTIS  lens.  Such  issues  as  the  effect  of 
non-normal  incidence  on  the  modulators,  crosstalk  and  others  has  been  studied  and  reported  in  the 
final  report.  Note  that  this  system  is  also  ideally  suited  for  implementation  with  the  Binary 
Computer  Generated  Hologram  (BCGH)  technology  for  reflective  type  modulators. 


Figure  3-8.  Area  multiplexed  illumination  lenslets  for  the  OTIS. 

3. 3. 7. 2  Beamsplitter 

Another  problem  in  the  optical  system  design  is  that  a  PBS  is  used  for  separating  the  modulator 
illumination  path  from  the  interconnect  path.  In  the  interconnect  path,  this  will  create  some 
crosstalk  and  efficiency  problems  since  the  acceptance  angle  of  a  PBS  is  very  limited  (±5°  at  best 
even  for  a  narrow  band  PBS).  Therefore  we  have  investigated  alternative  solutions:  wire-grid 
polarizers  and  volume  holograms. 

A  wire-grid  polarizer  could  be  considered  as  a  replacement  for  the  polarizing  beamsplitter  in  the 
optical  interconnection  module.  The  role  of  the  wire-grid  polarizer  in  such  a  module  would  be  to 
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bring  the  linearly  polarized  illumination  beam  (at  a  certain  direction)  to  the  light  modulators,  yet  not 
interrupting  the  interconnection  path.  This  may  be  accomplished  by  the  nature  of  the  wire-grid 
polarizer  that  it  is  virtually  transparent  to  the  orthogonally  polarized  light.  To  determine  if  a  wire- 
grid  polarizer  is  suitable  for  such  a  use,  its  transmission  and  polarization  characteristics,  especially 
the  off-axis  ones,  should  be  examined.  The  analysis  given  below  follows  directly  from  a 
generalized  model  for  wire-grid  polarizers  introduced  by  Yeh(3*7):  a  wire-grid  polarizer  consists  of 
parallel  wires  which  reflects  one  polarization  of  incident  electromagnetic  radiation  while 
transmitting  the  other,  provided  the  period  of  the  grid  is  smaller  than  the  wavelength  of  the 
radiation. 

In  the  generalized  optical  model  developed  by  Yeh,  a  wire-grid  polarizer  is  considered  as  a  thin 
sheet  of  composite  medium  consisting  of  parallel  slabs  (or  cylinders)  of  absorbing  material  (e.g., 
metal)  immersed  in  an  isotropic  base  medium.  If  the  dimension  of  the  wires  and  the  spacing 
between  them  are  sufficiently  small  compared  to  the  wavelength  of  the  electromagnetic  radiation, 
the  whole  structure  behaves  like  a  homogeneous  and  uniaxially  anisotropic  medium.  If  the  indices 
of  refraction  of  the  absorbing  medium  (wire)  and  the  base  medium  are  na  and  nb,  respectively, 

then  the  effective  indices  of  the  composite  structure  consisting  of  parallel  slabs  are  given  by  (3‘8) 


2  2,2 

W||  =Vana+\nb’ 

2  2  v  (n2 - nl ) 

n2  =  n2  + - - f- 

1  b  1  • 


(3-22) 


1+v,  (n  -nf)/nf 

b  a  b  b 


where  va  is  the  fraction  of  volume  occupied  by  the  absorbing  medium  and  vb  =  1  -  va.  These 
equations  are  known  as  Weiner’s  equation  of  form  birefringence. 

A  wire-grid  polarizer  made  of  good  conductor  behaves  as  a  good  conducting  metal  layer  with 
an  effective  index  of  refraction  nn  for  incident  radiation  with  electric  field  vector  parallel  to  the  wire 
grids  and  reflects  most  of  the  incident  radiation  if  the  layer  is  thick  enough.  For  incident  radiation 
with  electric  field  vector  perpendicular  to  the  wire  grids,  the  polarizer  behaves  as  a  dielectric  layer 
and  reflects  no  light  provided  the  polarizer  is  properly  anti-reflection  coated.  This  assumption  is 
true  for  the  extreme  case  when  the  wire  grids  are  made  of  perfect  conductors  and  the  base  medium 
is  pure  dielectric,  in  which  case  the  refractive  indices  given  by  Equation  (3-22)  become: 


nil  =  na 

nx  =  V(va+vb)/vb  nb, 


(3-23) 


since  lna^l  »  nb^.  The  condition  lna^|  »  nb^  is  true  for  metals  such  as  aluminum,  silver,  etc.  in  the 
infrared  spectral  regime.  At  X  =  10  pm,  lna^i  =  5200  and  4800  for  aluminum  and  silver, 
respectively/3"9)  Therefore,  if  both  modes  are  excited  at  the  input  face  of  the  polarizer,  the  incident 
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radiation  with  electric  field  vector  parallel  to  the  wire  grids  sees  a  complex  refractive  index  of  ny 

with  a  large  imaginary  part  and  is  attenuated  strongly,  while  the  radiation  with  electric  field  vector 
perpendicular  to  the  wire  grids  sees  a  real  refractive  index  of  nj_  and  propagates  freely  in  the 

polarizer. 

However,  this  is  not  completely  true  in  the  visible  spectral  regime.  At  X  =  0.55  |im , 
Ina^l  =  36  and  10  for  aluminum  and  silver,  respectively.  Hence,  n_L  is  not  a  real  number  but  has  a 

complex  part  as  given  by  Eq.  (3-22).  Therefore  the  radiation  with  electric  field  vector 
perpendicular  to  the  wire  grids  does  not  propagate  freely  but  experience  attenuation.  On  the  other 
hand,  ny  does  not  have  a  very  large  imaginary  part  anymore  and  the  radiation  with  electric  field 
vector  parallel  to  the  wire  grids  is  not  attenuated  strongly.  These  conditions  may  have  severe 
effects  on  the  transmission  and  polarization  characteristics  of  the  wire-grid  polarizer  in  the  visible 
spectral  regime.  The  polarization  selectivity  can  be  increased  by  increasing  the  volume  of  the 
metallic  layer.  This,  however,  decreases  transmission. 

To  have  an  idea  about  the  polarization  characteristics,  we  can  consider  the  wire-grid  polarizers 
developed  by  Texas  Instruments,  Inc33_1°)  by  vacuum  evaporation  technique.  Their  low  density 
polarizer  had  a  degree  of  polarization  of  about  0.88  and  transmission  of  0.40  at  0.8  jim.  A  higher 
density  one  had  a  degree  of  polarization  of  0.99  and  transmission  of  0.23  at  0.85  |im.  The 
performances  were  much  poorer  at  0.5  |im.  In  addition,  these  were  the  normal  incident  radiation 
characteristics. 

To  determine  the  polarization  properties  of  a  wire-grid  polarizer  in  the  case  of  off-axis  incident 
radiation,  it  is  sufficient  to  consider  the  relation  between  the  fields  incident  on  a  thin  uniaxially 
anisotropic  medium  and  the  fields  transmitted  through  it.  Consider  the  figure  below: 


Transmission  of  light  through  a  uniaxial  plate  at  off-axis  incidence  can  be  written  in  the  matrix 
form,  ignoring  the  multiple  reflections  in  the  plate,  as  : 
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where  tso,  tp0,  tSe>  tpe,  t0s>  tes,  top,  tep  are  the  transmission  coefficients,  koz  and  keZ  are  the  z 
components  of  the  wave  vectors  k0  and  ke,  respectively.  This  expression  relates  the  transmitted 
wave  amplitudes  As'  and  Ap'  to  the  incident  waves  As  and  Ap.  The  definitions  for  the 

transmission  coefficients  and  the  wave  vectors  have  been  omitted  here,  and  can  be  found  in  Yeh. 
However,  it  can  be  easily  seen  that  the  expressions  for  As'  and  Ap'  will  have  both  As  and  Ap 

terms.  In  other  words,  even  if  one  of  the  orthogonal  polarization  components  of  the  incident 
radiation  is  zero,  at  the  output  both  of  the  orthogonal  polarization  components  will  be  excited.  This 
concludes  that  a  wire-grid  polarizer  does  not  provide  an  advantage  over  a  beamsplitter  cube  when 
the  angular  field  properties  are  considered. 

Since  wire-grid  polarizers  do  not  offer  significant  advantages  over  PBSs,  we  are  now 
investigating  the  possibility  of  using  a  volume  hologram  (and  their  Bragg  selectivity  properties)  to 
differentiate  between  the  illumination  and  interconnect  optical  paths,  as  depicted  in  Figure  3-4.  The 
volume  hologram  is  recorded  by  interfering  two  orthogonal  plane  waves  in  the  holographic 
material.  When  readout  by  a  plane  wave  (power  in),  the  grating  couples  the  light  into  another  plane 
wave  (illumination)  which  is  used  by  the  illumination  lenslet  array  to  create  a  spot  array  on  the 
modulators.  When  the  light  is  reflected  by  the  modulators  it  hits  the  OTIS  lenslets.  Due  to  the 
topology  of  the  OTIS  interconnect,  no  significant  plane  wave  component  is  created  in  the 
interconnection  path  and  therefore  all  the  light  is  coupled  to  the  other  OTIS  lenslet  array  and 
focused  down  onto  the  detectors. 


Figure  3-9.  Volume  holographic  beamsplitter  for  the  OTIS. 

Since  a  single  hologram  is  recorded  in  the  holographic  medium,  it  can  have  very  high 
efficiency.  Also,  the  hologram  can  be  fixed  off-line  (using  for  example  thermal  fixing  in  L1NO3) 
and  then  used  in  the  system.  Such  issues  as  recording  uniformity,  crosstalk,  packaging,  optimal 
thickness,  optimal  holographic  material,  and  others  have  being  studied  and  shown  favorable 
responses. 
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3.4  OTIS  DESIGN  VERIFICATION 

To  demonstrate  the  OTIS  concept  an  experimental  system  has  been  designed  by  implementing  a 
one  stage  64x64  (NxM)  symmetrical  (N  =  M)  OTIS  optical  system.  The  system  consists  of  a 
64x64  pinhole  array  (the  input  plane)  and  its  associated  illumination  lenslet  array,  two  refractive 
lenslet  arrays  for  the  interconnection,  and  a  CCD  camera  as  the  output  plane  detector  (shown  in 
Figure  3-10). 
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Figure  3-10.  Experimental  setup  for  the  64x64  OTIS  demonstration. 


The  64x64  pinhole  array  was  fabricated  using  our  electron  beam  lithography  system.  The 
pinholes  were  obtained  via  chrome-etch  on  a  glass  plate  and  have  a  5x5  pm  square  aperture.  They 

are  spaced  57  pm  apart  for  a  total  input  plane  size  of  3.64  mm.  As  illustrated  in  Figure  3-11,  the 
pattern  formed  by  the  input  pinholes  represents  the  letters  OT-  The  illumination  lenslets  have  a 

190  pm  focal  length  with  an  aperture  of  57  pm  (f/#  ~3.3)  that  matches  the  pinhole  spacing  and 
array  format.  The  illumination  lenslets  were  also  fabricated  using  electron  beam  lithography 
system,  exhibiting  a  minimum  feature  size  of  1.87  pm.  However,  since  power  was  not  an  issue 
in  this  experiment,  the  lenses  are  kept  as  binary  amplitude  elements  with  a  simple  fabrication  step 
of  chrome-etch  on  glass.  If  light  efficiency  was  critical,  the  lenses  could  easily  be  fabricated  as 
multi-level  phase  elements.  The  interconnection  lenslet  arrays  were  purchased  from  Adaptive 
Optics  Associates  and  they  were  fabricated  in  stamped  epoxy  on  a  glass  microscope  slide.  The 
lenslet  array  size  is  20x32  and  each  lenslet  has  a  399  pm  aperture  for  a  focal  length  of  3. 12  mm, 

leading  to  a  f/#  of  approximately  8.  As  expected,  the  system  achieves  a  one  to  one  imaging 
transformation  and  the  output  plane  pattern  (see  Figure  3-11)  consists  of  the  Ox  letters  reduced  in 
size  by  a  factor  of  8  and  replicated  8x8  times  over  the  output  plane.  An  optimized  design  of 

this  experimental  system  is  currently  under  investigation.  The  optimization  will  be  based  on 
aspheric  diffractive  optical  elements  by  using  Code  V(3'1l)  optical  system  design  software. 
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Figure  3-11.  Experimental  results  for  a  64x64  OTIS  interconnection. 
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Section  4 

SPACE-TIME  COMPANDER 


The  addition  of  an  optical  interconnect  to  the  parallel  locally-connected  computer  architecture 
allows  an  efficient  optical  image  processing  and  global  2-D  operations.  The  Space-Time 
Compander  (STC)  provides  a  parallel,  bi-directional  communication  between  coarse-grain 
processor  arrays  and  fine-grain  subsystems,  such  as  optical  memories,  sensors,  or  optical  co¬ 
processors.  This  parallel  interface  must  allow  data  to  either  “expand”  or  be  “compacted”  in  its 
spatial  size:  from  the  large  fine-grain  array  size  to  the  smaller  coarse-grain  processor  array  size. 
The  most  straight  forward  approach  is  to  have  a  buffer  array  to  convert  the  2-D  spatial  (parallel) 
information  into  1-Dtime  (serial)  information.  The  concept  of  space-time  compander  is  illustrated 
in  Figure  4-1. 

The  buffer  array  structure  can  be  accomplished  by  the  use  of  the  charge  coupled  device  (CCD) 
technology.  This  is  based  on  the  fact  that  the  CCD  technology  can  provide  all  three  functions 
needed  for  such  buffer  structure  namely,  (1)  time-sequential  serial  charge  or  voltage  pattern 


HIGH  DENSITY  SPACE-TIME  3-D 

OPTICAL  MEMORY  COMPANDER  COMPUTING 

>(1024x1024)  (10242/1282)  UNIT 


MICROBRIDGE  READOUT  BEAM 


Si-EPIT AXIAL  LAYER 

Figure  4-1.  Schematic  of  space-time  compander  interface. 
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buffering,  (2)  imager  technology,  and  (3)  spatial  light  modulator  driver  technology.(4'1,4'2) 
Section  4.1  describes  the  design  of  modifying  the  Hughes  CCD-LCLV  and  the  associated  optical 
systems  for  the  STC  application.  In  Section  4.2,  a  detailed  CCD  design  analysis  for  the  CCD- 
based  STC  is  discussed. 


4.1  CCD-BASED  LIQUID  CRYSTAL  IMAGER/MODULATOR 


To  efficiently  communicate 
between  a  coarse-grain  128x128 
parallel  processors  and  a  fine-grain 
(1024x1024)  optical  information, 
the  STC  is  designed  to  buffer  each 
8x8  array  of  the  1024x1024  port  in 
a  superpixel.  The  superpixel  size 
was  determined  (448  pm)  in  an 
attempt  to  match  the  size  of  Hughes 
3-D  computer  module.  The  first 
design  issue  is  the  use  of  a 
combined,  dual  purposes 
modulator/detector  array  versus 
separate  modulator  and  detector 
arrays  in  each  superpixel  (as 
schematically  illustrated  in 
Fig.  4-2).  The  main  advantage  of 
the  combined  approach  is  a  simple 
optical  system,  while  its  main 
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Figure  4-2.  Schematics  of  one  superpixel  of  the  space- 
time  compander:  (a)  separate  modulator  and  detector 
arrays  and  (b)  combined  modulator/detector  array. 


disadvantage  is  the  comprised 

performance  of  both  detector  and  modulator  arrays.  On  the  other  hand,  the  separate  approach 
suffers  a  complicated  optical  system,  but  the  imager  and  modulator  can  be  individually  designed 
and  tuned  for  their  optimum  performance.  This  design  flexibility  is  necessary  only  if  we  plan  to 
deal  with  analog  data  with  a  large  gray  scale.  For  binary  data  the  combined  modulator/detector 
approach  is  preferred  for  the  minimum  optical  system  complexity  associated.  For  a  given  physical 
2-D  size  of  the  STC  superpixel  the  combined  modulator/  detector  approach  also  provides  a  larger 
pixel  size  and  thus  a  better  modulation  transfer  function  (MTF).  To  squeeze  two  8x8  CCD  arrays 
and  two  microbridges  in  a  448  pm  super-pixel,  the  CCD  pixel  size  must  be  highly  reduced  and 
may  not  be  able  to  support  a  50%  MTF  in  the  conventional  CCD-LCLV.  Extensive  research  efforts 
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will  then  be  required  to  push  for  the  resolution  improvement.  The  design  for  the  combined 
modulator/  imager  approach  is  described  next. 

4.1.1  CCD-LCLV  Based  Space-Time  Compander 

Analysis  shows  that  we  can  obtain  a  combined  modulator/detector  device  with  a  slight 
modification  of  the  current  Hughes  CCD-LCLV  structure.  By  replacing  the  5  pm  epilayer 
currently  used  on  the  p-Si  substrate  for  the  fabrication  of  the  CCD  gate  structures  with  a  thin 
(about  0.2  |am)  p-type  sheet  implant  layer  and  using  a  leaky  mirror,  the  CCD-LCLV  can  also  be 
used  as  an  aerial  imager.  For  this  purpose,  however,  the  combined  device  has  to  be  operated  at  two 
different  wavelengths  for  writing  into  and  reading  from  the  high  resolution  optical  element  (e.g., 
optical  memory,  etc.).  In  the  write-in  phase,  the  optical  beam  is  reflected  from  the  device  and  is 
modulated  by  the  two-dimensional  charge  pattern  information  in  the  CCD  array  for  addressing  the 
high  resolution  element,  while  in  the  readout  phase,  the  optical  information  from  the  high 
resolution  element  is  transmitted  through  a  specially  designed  leaky  mirror  resulting  in  a  two 
dimensional  charge  pattern  which  is  then  read  by  the  CCD  array.  To  obtain  high  resolution  in  the 
imager  it  is  necessary  to  deplete  the  entire  Si  substrate.  The  use  of  the  thin  p-type  implanted  layer 
allows  the  depletion  of  the  entire  Si  substrate  by  the  CCD  gate  voltage  that  would  not  be  otherwise 
possible  with  the  currently  used  p-type  epilayer. 

One  of  the  important  issues  in  developing  the  CCD-LCLV  based  STC  involves  device 
packaging.  To  directly  communicate  with  processors,  it  is  necessary  to  make  electrically  contact 
from  the  each  of  STC  superpixels  to  the  corresponding  node  on  the  processor  aray.  A  simple 
approach  is  to  treat  the  CCD-LCLV  as  a  single  3-D  wafer,  thus  it  can  communicate  with  the  3-D 
computer  through  the  microbridges  or  metal  bump  technologies.  This  approach  is  simple  for 
implementation,  but  it  lacks  a  precise  control  in  the  thickness  uniformity  of  liquid  crystal  layer. 
Although  this  approach  may  suffer  the  nonuniformity  of  output  light  modulation,  it  is  well  suited 
for  binary  data  modulation.  On  the  other  hand,  since  both  CCD-LCLV  and  the  parallel  processor 
array  (through  D/A  converters)  are  capable  of  dealing  analog  signal,  it  may  be  beneficial  to  push 
for  analog  devices.  For  this  purpose  it  will  be  necessary  to  control  the  flatness  of  the  CCD-LCLV 
by  improving  the  device  packaging  techniques. 

An  optically  flat  Si  substrate  is  required  in  the  CCD-LCLV  in  order  to  obtain  a  uniform  liquid 
crystal  layer  and  hence  a  uniform  output  light  intensity.  Currently  the  Si  substrate  in  the 
CCD-LCLV  is  flattened  using  our  transfer  bonding  technique'4^  in  which  the  Si  wafer  is  first 
contact  bonded  temporarily  to  an  optically  flat  glass  and  the  exposed  side  of  the  Si  wafer  is  then 
epoxy  bonded  to  a  supporting  glass  substrate.  After  curing  the  epoxy,  the  temporary  optical  flat  is 
then  removed  and  a  Si  surface  as  flat  as  the  optical  flat  is  exposed.  Consequently,  any 
nonuniformity  in  the  thickness  of  the  Si  substrate  is  embedded  in  the  epoxy. 
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Electrical  contact  can  be  accomplished  by  using  a  glass  substrate  with  conductive  feedthroughs 
in  which  an  array  of  holes  is  first  made  in  the  glass  and  then  filled  with  a  silver  fret  glass.  The 
silver  fret  is  initially  in  a  paste  form  and  mechanically  pushed  through  the  holes.  It  is  then  fired  at 
about  300°C  to  form  a  solid  conductive  glass  feedthrough.  The  holes  in  quartz  substrate  can  be 
made  by  laser  drilling  using  a  high  power  CO2  laser.  Holes  as  small  as  8-10  mils  in  diameter, 
which  are  within  the  predetermined  448  (im  superpixel  size,  can  be  made  using  this  technique.  It 
is  also  possible  to  use  fiber  optic  capillary  arrays  to  form  necessary  through-substrate  holes. 

The  conductive  feedthroughs  in  the  supporting  glass  substrate  can  be  electrically  connected  to 
the  contact  pads  on  the  CCD  circuit  by  using  the  indium  bump  or  flip-chip  bonding  technique.  An 
array  of  Indium  bumps  of  about  1  mil  in  size  with  a  periodicity  of  448  (im,  corresponding  to  the 
unit  supercell  size  in  the  3-D  computer,  can  be  evaporated  on  both  the  CCD  contact  pads  and  the 
conductive  feedthroughs  using  a  shadow  mask  before  the  transfer  bonding  process.  The  design 
concept  of  the  freedthrough  packaging  is  illustrated  in  Figure  4-3.  Again,  this  approach  is  not 
critically  needed  if  the  binary  operation  is  adapted  in  the  system. 


CCD  PERIPHERAL 
CONTACTS 


Figure  4-3.  Cross  section  of  the  CCD-LCLV  with  conductive  feedthroughs  in  the  supporting 
glass. 

To  make  contact  between  the  conductive  feedthroughs  in  the  glass  support  just  mentioned  and 
the  corresponding  pads  of  the  STC  superpixels,  we  can  use  a  special  adhesive  which  is  conductive 
only  in  one  direction/4-4'  As  a  result  of  embedded  gold-coated  particles  which  are  compressed 
along  this  direction  during  curing,  this  adhesive,  which  is  referred  to  as  the  z-axis  adhesive, 
conducts  only  in  the  direction  along  its  thickness.  There  is  no  conduction  in  the  lateral  directions 
due  to  the  low  density  of  these  particles.  This  approach  has  been  verified  to  be  suitable  for  the  STC 
assembly. 
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4.1.2  Mechanical  Design 

In  order  to  use  the  CCD  chip  as  an  optical  device  it  is  necessary  to  package  it  so  that  the  back 
surface  of  the  chip  with  the  liquid  crystal  layer  is  exposed.  This  means  that  the  chip  has  to  be 
mounted  in  a  flip-chip  fashion  with  the  CCD  side  in  contact  with  the  substrate.  This  requires  an 
interconnect  method  between  the  CCD  pads  and  the  substrate.  This  is  frequently  done  in  the 
industry  using  a  Z-axis  adhesive  that  conducts  vertically  between  the  substrate  and  the  chip,  but  not 
horizontally  between  lines  or  pads. 

A  design  was  generated  for  the  layout  of  the  substrate.  This  included  analyzing  the  overall  size 
constraints  and  coming  up  with  a  fanout  pattern  that  allowed  as  many  superpixels  as  possible  to  be 
brought  out  to  a  contact  area  around  the  edge.  The  drive  signals  also  had  to  be  brought  in  through 
some  kind  of  detachable  connector.  A  mechanical  design  of  the  mounting  fixture  was  also 
developed  since  it  was  an  integral  part  of  the  system  surrounding  the  support  substrate. 

A  one  inch  square  quartz  plate  was  chosen  as  the  substrate  for  the  compander  chip.  A  fanout 
and  interconnect  pattern  was  designed  that  provided  connection  for  the  drive  signals  along  one 
edge  and  contact  to  88  of  the  superpixels  on  the  other  edges.  The  drive  signals  will  be  connected  to 
the  substrate  using  a  mating  Kapton  ribbon  cable  with  a  pressure  plate  contact.  An  alternative 
fanout  pattern  was  also  designed  that  allowed  contact  to  the  optical  and  electrical  test  devices  on  the 
side  of  the  chip. 

The  mechanical  design  of  the  mounting  fixture  was  also  completed.  It  was  desireable  to  keep 
the  size  small  and  the  assembly  simple.  The  mounting  base  plate  for  the  substrate  was  only  one 
half  inch  larger  than  the  substrate  and  will  support  the  pressure  connector  for  the  drive  signals.  A 
counterelectrode  holder  and  pressure  plate  will  also  mount  directly  to  the  base  plate.  The  whole 
assembly  could  be  mounted  vertically  for  testing  on  an  optics  bench. 

A  mechanical  design  was  also  developed  for  depositing  the  ITO  layer  on  the  counter-electrodes 
with  a  contact  metalization  on  the  side  of  the  glass.  The  technique  used  a  shadow  mask  approach 
with  an  angled  deposition  for  the  metal.  Glass  samples  used  under  the  CCD-LCLV  project  were 
found  to  be  compatible  with  the  compander,  so  it  was  not  necessary  to  re-design  new  pieces. 
Drawings  for  the  shadow  mask  fixture  were  fabricated. 

A  Kapton  ribbon  connector  was  designed  to  provide  signal  contacts  to  the  compander 
substrate.  The  layout  was  designed  using  a  PCB  CAD  program  and  the  artwork  was 
photographically  reduced.  The  pattern  was  contact  printed  onto  the  Kapton  material  and  etched. 
The  copper  traces  were  tin  plated  in  the  contact  area  to  prevent  oxidation.  Continuity  was  measured 
to  ensure  all  lines  were  defect  free.  The  edge  of  the  ribbon  will  be  pressed  against  the  substrate 
contacts  using  a  pressure  plate  with  a  rubber  gasket.  The  other  end  of  the  ribbon  cable  had  a 
commercially  available  high  density  connector  that  connected  to  the  drive  electronics. 


35 


The  cable  to  the  drive  electronics  was  wired  in  parallel  with  the  probe  card  connector  so  the 
mounting  fixture  or  the  probe  station  could  be  used  without  having  to  interchange  coax  cables. 
The  mounting  fixture  was  tested  with  a  continuity  check  to  make  sure  all  the  signals  were  getting 
through  to  the  CCD  substrate. 

4.1.3  Electronic  Driver  Design 

For  the  Space-Time  Compander  to  be  used  as  an  imager  it  will  be  necessary  to  view  the  output 
data  visually  to  demonstrate  proper  operation.  A  computer  program  was  written  to  display  the 
active  elements  of  the  superpixel  array  on  the  computer  screen  with  the  proper  aspect  ratio  and 
spacing.  The  data  in  selected  superpixels  could  be  dynamically  updated  on  the  screen  as  the  CCD 
was  read  out.  A  commercial  24  channel  logic  analyzer  module  was  used  to  sample  the  superpixels 
simultaneously  and  transmit  the  data  over  a  high  speed  serial  link  to  the  host  computer.  The  update 
rate  was  somewhat  less  than  the  actual  CCD  frame  rate,  but  the  visual  image  would  give  the 
desired  results. 

A  system  level  design  was  completed  for  driving  the  compander  in  both  the  imager  and  light 
valve  modes.  The  logic  analyzer  module  was  used  to  accept  data  in  the  imager  mode  and  output  it 
over  the  serial  interface.  The  same  signal  lines  were  used  to  input  the  data  in  the  light  valve  mode. 
A  circuit  board  would  be  added  to  the  CCD  clock  timing  generator  to  allow  data  patterns  to  be 
generated  for  stripes  and  squares  of  various  pixel  sizes.  The  patterns  were  static  but  could  be 
changed  through  wiring  options  for  each  superpixel.  A  new  computer  system  was  installed  with  a 
higher  clock  rate  to  speed  up  the  display  process  somewhat,  although  a  faster  system  would 
ultimately  be  wanted  for  real  time  demonstrations. 

4.1.4  Optical  Interfaces 

Limited  by  the  physical  presence  of  microbridges  in  the  3-D  computers  and  the  nature  of  the 
CCD  structures,  the  STC  supercells  must  be  physically  separated  from  each  other  with  a  fairly 
large  gap.  This  prevents  us  from  using  a  simple  magnify/demagnify  optical  system  to  bridge 
between  fine-grain  images  and  the  coarse-grain  processors.  The  optical  system  in  the  STC  must  be 
capable  of  grouping  supercells  from  continuous  fine-grain  images  and  matching  those  supercells  to 
isolated  coarse-grain  processors.  This  function  may  be  achieved  with  appropriate  lens  arrays.  The 
lens  arrays  are  used  to  group  the  supercells  and  physically  separate  them  from  a  continuous  image, 
the  concept  is  illustrated  in  Figure  4-4.  Each  lens  in  the  array  will  image  a  portion  of  the  fine-grain 
image  to  the  corresponding  CCD  sub-array.  The  design  constrain  is  to  prevent  multiple  images 
from  neighboring  lenses  overlapped  on  a  single  CCD  imager.  On  the  other  hand,  the  images  are 
allowed  to  be  overlapped  in  the  area  between  the  CCD  arrays.  Thus,  a  practical  match  can  be 
achieved  by  balancing  the  image  magnification  ratio  (MR)  and  the  coverage  ratio  (CR).  Here  the 
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coverage  ratio  is  defined  by  the  ratio  of  the  size  of  input  image  “viewed”  by  a  single  lens  to  the  size 
of  the  supercell.  The  lens  design  must  follow  the  constraint: 


CR< 


2 -MR 
MR 


(4-1) 


here  the  magnification  ratio  is  less  than  one;  demagnification  is  required  for  the  operation.  Once  the 
lens  array  satisfies  the  imager  requirement,  it  will  automatically  meet  the  need  for  the  modulator 
function.  This  is  based  on  the  fact  that  there  is  no  inofrmation  in  the  adjacent  area  between  the  CCD 
arrays. 
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Figure  4-4.  The  concept  of  using  microlens  array  for  mapping  optical  images  to  physically 
separated  CCD  arrays.  The  ideal  case  is  to  demagnify  the  necessary  area  in  the  optical  image  to  the  CCD  array . 
Image  overlapping  is  allowed  if  the  lens  imaging  constraint  is  followed. 


4.2  CCD  DESIGN  FOR  STC  APPLICATION 

The  design  of  the  CCD  Compander  circuit  on  the  input  (front)  side  of  a  wafer  involves  the 
analysis  of  the  normal  CCD  device  and  also  the  interaction  of  the  CCD  to  the  liquid  crystal  devices 
on  the  output  (back)  side  of  the  wafer.  Initial  design  efforts  for  the  CCD  circuit  on  the  STC  were 
concentrated  on  defining  the  specifications  of  the  circuit.  The  physical  constraints  imposed  by  the 
backside  spatial  light  modulating  (SLM)  layers  and  the  compatibility  with  the  3-D  Computer  were 
considered.  These  specifications  and  considerations  were  used  as  the  basis  in  establishing  the  CCD 
process  supported  by  our  potential  CCD  foundry  (Orbit  Semiconductor).  The  design  issues  span 
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over  three  inter-related  levels:  circuit  design,  device  design  and  process  design.  This  section 
describes  the  pertinent  design  parameters  in  each  level. 

4.2.1  CCD  Design  Analysis 

The  process  and  device  designs  are  usually  driven  by  circuit  level  specifications.  The  CCD 
circuit  has  two  physical  requirements:  it  should  contain  a  16x16  array  of  identical  super-pixels  and 
each  super-pixel  should  equal  in  size  to  a  node  on  the  parallel  processor  array.  The  existing  3-D 
Computer  design  has  a  node  size  of  448  |lm  x  448  |ira.  Each  superpixel  is  further  required  to 
consist  of  a  8x8  array  of  serial-parallel-serial  (SPS)  CCD  pixels,  one  I/O  pad  and  the  necessary 
input  and  output  circuitry  as  shown  in  Figure  4-5.  Given  these  constraints,  the  CCD  pixel  size 
should  be  maximized  to  allow  maximal  charge  transfer  between  the  CCD  and  the  liquid  crystal 
devices  while  keeping  crosstalk  among  pixels  minimal.  A  pixel  size  of  30  (tm  x  30  pm  is  chosen 
with  22  jim  wide  channel  and  8  |im  spacing.  Next  important  circuit  parameter  is  the  number  of 
clock  phase  of  the  CCD  array.  It  must  satisfy  the  layout  constraint  imposed  by  the  above  pixel  size, 
and  the  signal  integrity  requirement  which  is  determined  by  the  operational  frequency  and  the 
charge  transfer  efficiency.  The  operational  frequency  is  targeted  at  1.25  MHz  for  analog  signals, 
assuming  an  8-bit  D-to-A  conversion  from  the  3-D  Computer  which  runs  at  10  MHz.  To 
safeguard  against  process  and  design  uncertainties,  two  clock  schemes  (two-phase  and  three- 
phase)  for  the  serial-CCD  circuit  had  been  included  on  the  wafers.  The  two-phase  structure  is 
desirable  because  it  requires  one  less  clock  driver  and  the  clock  line  width  is  slightly  larger. 
However,  it  requires  an  extra  implant  and  is  more  critical  regarding  clock  drive  parameters. 

In  addition  to  the  CCD  array  pixels,  we  have  to  design  the  input  circuit  and  the  output 
amplifier.  We  have  decided  to  use  the  charge  presetting  (Tompsett  scheme)  input  circuit  for  its  high 
linearity  and  low  noise  characteristics.  For  the  output  amplifier,  we  have  picked  the  floating 
diffusion  scheme.  The  detailed  design  of  the  amplifier  requires  more  detailed  simulation  which  will 
be  discussed  later. 

The  primary  device  design  parameter  is  the  charge  transfer  efficiency.  It  is  a  function  of, 
among  other  parameters,  the  clock  voltage  and  whether  the  CCD  is  a  surface  or  buried  channel 
device.  For  process  simplicity,  we  incline  towards  a  surface  channel  device  but  this  decision 
requires  confirmation  from  analysis  on  the  charge  transfer  efficiency  and  will  be  discussed  in  next 
section.  Another  important  device  parameter  is  the  gate  voltage/substrate  doping  required  to  deplete 
the  substrate  fully  (from  front  to  back  side)  for  the  imager  mode  of  operation.  The  objective  is  to 
deplete  the  substrate  with  as  low  a  voltage  as  possible  within  the  constraints  of  the  wafer  thickness 
and  substrate  doping  concentration.  From  a  mechanical  standpoint,  we  have  succeed  in  processing 
wafer  with  thickness  not  smaller  than  125  Jim  and  this  will  be  our  thickness  target.  We  can  also 
obtain  p~-type  wafers  with  doping  level  not  higher  than  5xl012/cm3  consistently.  Given  these 
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targets,  the  expected  full  depletion  voltage  is  calculated  to  be  about  60  V.  However,  this  substrate 
doping  level  would  be  too  low  for  the  CCD  circuit  operation  and  an  epi  layer  of  higher  doping  level 
is  required.  To  facilitate  deep  through-wafer  depletion,  the  5  pm  epilayer  generally  used  on  the  p_- 
Si  substrate  for  fabricating  the  CCD  gate  structures  is  replaced  with  a  thin  (about  0.2  pm)  p-type 
sheet  implant  layer.  In  the  output  amplifier  section,  the  MOS  transistor  also  needs  characterization 
to  avoid  source-drain  punch-through.  This  affects  the  operating  voltage  as  well  as  the  minimum 
gate  length  (hence  speed)  of  the  output  device.  Other  device  parameters  that  need  to  be 
characterized  include  the  threshold  body  factor  and  the  junction  capacitance  which  impact  the  circuit 
operation. 


Figure  4-5.  Floorplan  of  one  superpixel  of  STC  indicating  direction  of  charge  flow. 

In  the  process  design  area,  the  number  of  poly  (CCD  electrode)  layers  will  be  determined  by 
the  clock  phase  requirement  of  the  CCD  circuit.  A  three-phase  clock  requires  three  poly  layer  while 
a  two-  or  four-phase  clock  requires  only  two  poly  layer.  For  a  two-phase  clock  an  additional 
barrier  implant  is  needed,  though  it  can  be  the  same  as  the  threshold  adjust  implant.  Another 
important  process  parameter  is  the  gate  dielectric  thickness,  which  has  to  support  the  maximum 
gate  voltage  at  full  depletion.  The  epi  layer  and  the  threshold  adjust  implant  are  the  two  major 
adjustable  processing  steps  that  have  direct  impact  on  the  device  and  circuit  characteristics 
described  above.  Finally,  the  source-drain  dopant  specie  (arsenic  versus  phosphorus)  determines 
the  junction  depth  and  affects  the  punch-through  voltage. 
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4.2.2  CCD  Design  Analysis  Tools 

In  order  to  quantify  the  interaction  of  the  above  mentioned  circuit,  device  and  process 
parameters,  both  analytical  and  numerical  tools  have  been  employed.  For  analytical  tools,  we  have 
developed  two  spreadsheet  (Excel)  programs.  The  first  program  calculates  the  doping  profile  of  the 
epi  layer  and  the  threshold  adjust  implant  as  a  function  of  the  process  parameters  such  as  implant 
dose  and  high  temperature  processing  steps.  It  also  explores  the  approximate  replacement  of  the  epi 
layer  by  a  well  implant  and  a  long  drive-in  step  which  we  shall  also  pursue  experimentally.  This 
program  approximates  the  implanted  profile  as  a  Gaussian  profile  and  calculates  the  spreading  of 
the  dopants  due  to  thermal  diffusion  while  taking  into  account  the  segregation  of  the  dopant 
(boron)  at  the  oxide  interface.  The  final  profile  is  then  fitted  with  a  Gaussian  profile  which  can  then 
be  used  in  commercial  2-D  numerical  device  simulation  programs.  A  sample  of  the  spreadsheet 
printout  together  with  plots  of  the  epi  layer,  the  replacement  well  implanted  profile  and  the 
threshold  implanted  profile  are  shown  in  Figure  4-6. 

The  second  spreadsheet  program  calculates  the  charge  transfer  efficiency  as  a  function  of  the 
number  of  clock  phase,  voltage  and  other  process,  device  and  circuit  parameters,  both  for  the 
Horizontal  (serial)  and  the  Vertical  (parallel)  CCD  devices.  The  program  takes  into  account  the  self- 
induced  field,  the  fringing  field,  the  diffusion  and  the  surface-state  components,  and  combine  them 
in  an  approximate  way  to  arrive  at  an  effective  transfer  efficiency.  This  program  also  calculates  the 
1-D  full  depletion  voltage  under  the  CCD  gate,  the  threshold  voltage  body  factor  and  the  junction 
capacitance  coefficients  of  the  MOS  transistor  in  the  output  amplifier  circuitry.  The  last  two  sets  of 
parameters  will  be  used  as  device  parameter  inputs  to  commercial  circuit  simulation  programs.  A 
sample  of  the  spreadsheet  printout  is  shown  in  Figure  4-7. 

Two  numerical  simulation  programs  have  been  used  for  the  CCD  design.  On  the  device  level, 
the  2-D  device'  simulation  program  PISCES  was  used  for  source-drain  punch-through  simulation. 
Given  a  MOS  transistor  structure  and  the  applied  terminal  voltages,  the  program  will  solve  the  2-D 
Poisson’s  and  Continuity  equations  for  the  electrical  potential  and  the  carrier  concentrations.  A 
sample  potential  profile  is  shown  in  Figure  4-8  showing  the  separation  of  the  depletion  edges 
between  the  source  and  the  drain  of  the  transistor.  This  simulation  is  performed  with  a  drain 
voltage  of  16  V,  giving  enough  margin  for  a  12  V  operation.  On  the  circuit  level,  circuit  simulator 
MacSPICE  was  used  to  simulate  the  CCD  output  amplifier  and  to  guide  the  layout  design  of  the 
various  transistors.  Because  of  the  highly  non-uniform  substrate  doping  seen  by  the  transistors, 
two  device  models  were  used  to  represent  the  transistors  that  operate  in  the  respective  substrate 
voltage  ranges.  The  output  amplifier  circuit  schematic  and  its  voltage  and  current  waveforms  are 
shown  in  Figure  4-9. 
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Figure  4-6  (a).  Spreadsheet  program  and  plots  for 
implanted  profile  and  the  threshold  implanted  profile. 
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Figure  4-7.  Spreadsheet  program  for  calculating  the  charge  transfer  efficiency,  the  full 
depletion  voltage  under  the  CCD  gate,  the  threshold  voltage  body  factor  and  the  junction 
capacitance  coefficients  of  the  MOS  transistor  in  the  output  amplifier. 
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CCD  punch-through  -  Potential 


Source  ’  Gate  Drain 


Distance  (Microns) 


Figure  4-8.  Cross-sectional  2-D  potential  profile  of  the  output  MOS  transistor  showing  the 
separation  of  the  depletion  edges  from  the  source  and  the  drain,  as  generated  by  punch-through 
simulation  using  PISCES. 
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Figure  4-9.  Output  amplifier  circuit  schematic,  device  models  and  the  voltage  and  current 
waveforms  as  generated  by  circuit  simulation  using  SPICE. 
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4.2.3  CCD  design  analysis  results 

Based  on  the  above  design  analysis,  we  completed  the  CCD  circuit  design  and,  together  with 
the  circuit  foundry,  established  the  process  necessary  for  its  fabrication.  The  basic  design 
parameters  are  summarized  below  : 

1 .  Circuit: 

•  Surface  channel  CCD 

•  Two  phase  clock  or  three  phase  clock  for  H-CCD,  and  three  phase  clock  for  V-CCD 

•  Charge  presetting  input  scheme 

•  Floating  diffusion  output  amplifier  with  one  source-follower  stage 

2 .  Device: 

•  12  V  operating  voltage  except  for  the  RESET  and  OUTEN  control  lines  which  use  18  V 

•  Full  depletion  voltage  about  64  V 

•  Minimum  gate  length  of  output  transistor  is  6  p.m 

•  Threshold  voltage  of  output  transistors  is  2  V  with  body  factor  ranging  from  1.9  to  3.0  Vv 

•  Zero-bias  junction  capacitance  is  1.2xl0"4  F/m2  with  voltage  coefficient  =  1.67-2 

3 .  Process: 

•  2 -jim-thick,  lxlO15  per  cm3  p-epi  layer  on  5xl012  per  cm3  p'-substrate 

•  Triple  poly 

•  Single  metal 

•  950-A-thick  oxide  +  750-A-thick  nitride  gate  dielectric 

•  Threshold  adjust  (also  serve  as  barrier)  implant  parameters  :  9x10'  Vcm2  boron  at  100  keV 

•  Arsenic  source-drain  implanted  junction  of  0.4-pm  deep 

An  overall  charge  transfer  efficiency  of  about  94%  is  expected  from  this  design  when  operating  at 
1 .25  MHz  or  slower. 

Overall  we  had  eight  different  versions  of  the  circuit  chips  incorporated  in  the  floorplan  of  the 
whole  wafer  and  one  of  each  was  available  for  pre-thinning  characterization.  In  each  of  the  circuit 
chips,  besides  the  16x16  superpixels,  there  were  test  circuits  that  could  be  used  to  characterize  line 
resolution  and  transfer  efficiency  of  the  front-back  charge  transfer  process.  There  were  also  test 
circuits  for  testing  the  operation  of  individual  super-pixel,  a  shorter  serial-CCD  chain  and  a  shorter 
parallel-CCD  chain  (the  product  had  an  8x8  CCD  array).  Figure  4-10  shows  the  floorplan  of  a 
typical  Compander  chip  with  test  cells  populating  to  the  right  and  below  the  16x16  superpixel 
array.  Figure  4-11  shows  the  layout  of  the  superpixel  containing  an  8x8  CCD  array. 


46 


Section  5 

EOGA  SYSTEM  ANALYSIS 


For  the  purpose  of  EOCA  system  analysis  we  have  developed  switching  energy  requirements 
and  delay  models  for  RC  limited  lines  in  the  3-D  computer,  as  well  as  simplified  models  for  Si- 
CMOS  light  receivers,  and  integrated  lasers.  We  developed  models  for  comparing  the  performance 
of  electronic  and  optical  interconnects.  Results  (see  Subsection  5.1)  show  significant  advantages 
for  free-space  optical  interconnects  versus  RC  limited  lines  in  the  3-D  computer  environment.  In 
the  general  area  of  the  EOCA,  we  also  optimized  the  architecture  of  the  one  stage-shuffle 
interconnection  network  for  permutation  traffic  in  the  3-D  computer  (see  Section  5.2)  and 
developed  the  concept  of  the  time-dilated  network  for  low  blocking  communications.  The  routing 
and  sorting  and  the  FFT  operations  on  the  EOCA  system  are  analyzed  and  presented  in 
Subsection  5.3. 

5.1  ELECTRONIC  VERSUS  OPTICAL  INTERCONNECTS  IN  THE  3-D 
COMPUTER 

An  important  study  in  this  program  is  to  identify  when  and  how  free-space  optical 
interconnects  should  be  used  in  the  3-D  computer.  In  has  been  proven  that,  architecturally,  it  is 
critical  to  provide  global  interconnections  to  the  3-D  computer  to  speed-up  the  machine,  in 
particular  for  image  processing  applications.  In  this  section  we  first  outline  th  elimitations  of  the 
3-D  electronic  systems  compared  to  similar  3-D  stacked  systems  which  would  be  interconnected 
using  free-space  optoelectronic  technology.  Then  we  show  that  from  a  power/throughput  point  of 
view  free-space  optoelectronic  technology  is  a  good  candidate  to  implement  such  global 
connections. 

5.1.1  Architecture  Consideration 

It  is  instructive  to  compare  the  potential  performance  of  the  Hughes  3-D  VLSI  computer  with 
its  electronic  counterparts  as  well  as  with  the  potentials  of  the  optically  augmented  3-D  computer. 
Here,  computational  tasks  for  low  level  and  higher  level  image  processing  applications  are 
considered.  These  tasks  can  be  classified  either  by  their  computational  complexity  or  by  the 
globality  of  information  transfer  required  by  the  computation.  Simple  operations  involving  multiple 
images  such  as  image  addition  and  subtraction  require  only  point-wise  operations  between  images. 
Point  or  “histogram”  based  operations  manipulate  the  value  of  a  pixel  based  on  the  values  of  some 
or  all  or  the  other  pixels  in  the  image,  but  do  not  require  knowledge  about  specific  pixel  location. 
Window  or  “convolution”  based  operations  apply  kernels  of  arbitrary  sizes  to  enable  pixel  values 
to  be  modified  by  a  function  of  the  other  pixels  within  the  window.  For  compute  bound 
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operations,  such  as  matrix  inversion  or  window-based  convolution,  erosion,  dilation  etc.,  the  3-D 
computer  provides  fast  processing  by  virtue  of  its  single-instruction  multiple  data  parallel 
processing  capability:  window  operations  are  reduced  from  0(N2K2)  time  for  a  serial  processor  to 
0(K2)  time  for  the  3-D  computer  for  a  kernel  of  size  KxK  pixels;  likewise  the  time  complexity  of 
matrix  inversion  is  reduced  from  O(bP)  to  0(N).<5'n 

In  the  cases  where  the  computational  complexity  is  bound  by  the  communication  capability  of 
the  machine,  the  use  of  an  optoelectronic  multistage  interconnection  network  (MIN)  within  the  3-D 
computer  can  significantly  reduce  processing  times.  In  Table  (5-1),  processing  times  for  the 
optically  augmented  3-D  machine  are  compared  to  the  3-D  computer  and  a  serial  electronic 
approach  for  communication  bound  operations.  For  applications  such  as  routing  and  FFT,  that  are 
naturally  suited  to  a  Butterfly  network  structure,  the  optoelectronic  system  provides  significant 
speedup.  For  certain  point  operations,  data-sorting  may  be  required.  This  will  be  possible  in 
0(Log2N)  time  using  a  MIN.  Finally,  KxK  kernel  generation  can  be  performed  in  parallel 
assuming  additional  simple  functionality  can  be  incorporated  into  the  switching  elements  of  the 
multistage  interconnection  network.(5  2) 


TABLE  5-1.  Performance  comparison  of  the  regular  and  augmented  3-D  VLSI  computer  for 
Image  processing  applications. 


NxN  image 

KxK  window 

Serial  Electronic 
Processor 
#  processors  =  1 

3-D  VLSI  Computer 
#  processors=P=N 

Optically  Augmented 
3-D  VLSI  Computer 
#  processors  =P=N2 

Point  Operations 
(e.g.  Histogram  Calc.) 

0(N2) 

O(N) 

0(Log22N) 

Fourier  Transform 
(DFT/FFT) 

0(N2Log2N) 

O(N) 

0(Log2N) 

Routing 

(e.  g.  matrix  transpose) 

0(N2) 

O(N) 

0(Log2N) 

KxK  Kernel  Generation 

0(K2) 

0(K2) 

G(Log22K) 

5.1.2  Technology  Consideration 

As  illustrated  in  Fig.  (5-1),  there  are  two  ways  of  implementing  global  connections  within  the 
3-D  VLSI  computer:  one  using  free-space  optoelectronics,  the  other  with  a  dedicated 
interconnection  layer.  We  assume  that  the  interconnections  on  the  dedicated  layer  are  RC  limited 
lines  since  their  lengths  and  their  operating  frequency  do  not  provide  transmission  line  regime.  The 
electrical  model  we  use  for  the  electrical  interconnections,  from  the  source  on  the  lower  layer  to  the 
destination  on  the  upper  layer  through  an  A1  line  on  the  interconnection  layer,  is  shown  in 
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Fig.  5-2(a).  Note  that  throughout  this  section  it  is  assumed  that  data  is  coded  on  the  links  in  an 
Non-Retum-to-Zero  fashion.  In  addition,  to  simplify  the  derivations,  a  50%  duty  cycle  is  also 
assumed. 


I  I  I 


I  I 


OPTICAL  IMPLEMENTATION 


ELECTRICAL  IMPLEMENTATION 


Figure  5-1.  Electronic  vs.  optoelectronic  global  connections  in  a  3-D  computer  system. 


From  source  to  destination,  the  electrical  model  contains:  a  driver  D  with  an  output  capacitance 
Cdout,  a  switch  (CS,RS),  a  bridge  (Cbr)  and  via  (Cvia,  Rvia>  beak)  pair,  the  Aluminum  line  on  the 
interconnection  layer  (Rint>  QntX  another  bridge/via  pair,  and  a  CMOS  receiver  R  with  an  input 
capacitance  Crjn.  The  switches  provides  bi-directionality  to  the  connection  in  order  to  preserve  the 
functionality  of  the  3-D  computer.  In  this  model  we  use  a  standard  superbuffer  (multistage)  design 
for  the  source  driver. (5'3)  It  can  be  shown  that  the  required  number  of  stages  (n)  to  drive  the 
connection  and  the  delay  (Td)  introduced  by  the  driver  is: 

Ln(^logd_) 

C 

n  = - -J- 

Ln(p) 

T d  =  n/l  +(P  ~  1)P]RQCQ  (5_  i  \ 
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We  use  standard  CMOS  1.2  technology  parameters  which  leads  p  =  5,  Co  =  23fF, 
Ro  =  8.7  KO,  Ci0ad  =  3  pF,  and  p  =  0.25.  This  yields  n  =  3  and  Td  =  1.3  nsec.  The  vertical 
delay  for  the  via/bridge  pair  can  be  calculated  as:<5'4) 


TVB  =  0.7R5[Cs  +  C[„  +  CgUt  +  CBR  +  CVIA  +  2  Cs  +  C%  j 

+  0.4RVIACV[A  +  0.1  Rvia  2 Cs  +  C;^ |  +  0.7/^Q  +  C[% J 


(5-2) 


Via 


-  -  -<3-L"^4- 


Bridge 
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Figure  5-2.  (a)  Physical  and  circuit  equivalent  model  of  electrical  connection  in  the  3-D  VLSI 
computer,  (b)  Simplified  circuit  equivalent  model  of  an  optoelectronic  interconnection 
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with  Rvia  =  20  Q  (resistance  of  a  via),  Cbr  +  Cvia  =  1-5  pF  (capacitance  of  a  bridge  +  a  via),  and 
CrP;n  =  0.86  pF  (input  capacitance  of  a  repeater).  Crin  is  the  input  capacitance  of  the  CMOS 
receiver  and  Cd0ut  is  the  output  capacitance  of  the  driver.  Note  that  the  switches  are  chosen  so  that 
they  are  as  large  as  the  largest  inverter  which  leads  Rs  =  100  Q  and  Cs  =  60  fF.  By  combining 
Eq.  (5-1)  and  Eq.  (5-2)  and  using  the  values  for  1.2  pm  technology  it  can  be  shown  the  total 
vertical  propagation  delay  (Ty)  from  the  driver  to  the  interconnection  layer  is: 

Ty  —  1.7nsec  f5-3f 


Then,  we  need  to  calculate  the  delay  for  the  A1  interconnection  line  on  the  interconnection  layer. 
In  order  to  minimize  the  delay  on  this  line,  it  is  necessary  to  use  optimally  designed  repeaters  in 
terms  of  size  and  spacing.  For  1.2  pm  technology,  it  can  be  shown  that  for  a  line  of  length  Ljnt, 
the  optimal  number  of  repeaters  (k),  the  optimal  repeater  size  (h,  in  terms  of  minimum  gate  size) 
and  the  delay  (Tiw)  are:(5'4) 

k  =  hnL 
1.67 
h  =  ll 

7’/W=0.4Lint  (5_4) 


Since  the  line  segments  between  repeaters  (1.67  cm)  have  similar  parasitic  characteristics  as  the 
bridge/via  pair,  the  last  repeater  on  the  interconnection  layer  is  capable  of  driving  the  vertical 
connection  through  via/bridge  to  the  CMOS  receiver  at  the  destination.  The  delay  (Tl)  of  this  last 
part  of  the  interconnection  can  be  calculated  from  Eq.  (5-4)  with  Lint=1.67  cm: 

7T=0.67nsec  (5-5) 


Therefore,  the  total  delay  for  source  to  destination  (Tsd)>  as  a  function  of  the  interconnection  line 
length  on  the  interconnection  layer  and  given  in  nsec,  is  simply  found  by  adding  Eqs  (5-3),  (5-4), 
and  (5-5): 

Tsd  ~  Ty  +  Tl  +  TIW  =  2.4  +  0.4L;nt 


The  delay  due  to  the  interconnection  will  then  limit  the  maximum  clock  frequency  at  which  a 
connection  of  a  given  length  can  be  driven.  Two  cases  are  being  envisaged:  one  where  the  data  is 
sent  straight  from  the  source  to  the  destination,  one  where  the  data  is  being  latched  on  the 
interconnection  layer  and  then  resent. 

The  total  equivalent  driver  size  (H)  of  the  interconnect  from  source  to  destination,  where  H  is 
expressed  in  terms  of  the  relative  size  to  a  minimum  size  inverter,  is  then: 

//  =  31  +  77f^IlL  +  ll 

U.67  J  (5-7) 
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which  corresponds  to  three  inverters  in  the  superbuffer  of  sizes  1,  5,  and  25  and  Lim/1.67 
repeaters  of  size  77. 

The  other  metric  of  interest  for  this  analysis  is  the  energy  (Eeiec)  required  to  drive  the 
connection  from  source  to  destination  as  a  function  of  the  line  length  on  the  interconnection  layer. 
It  can  be  shown  that  the  total  switching  energy  required  per  bit  is  simply  the  sum  of  the  DC  energy 
due  to  the  junction  leakage  in  the  vias  plus  the  switching  energy  defined  as  the  power  required 
times  the  rise  time: 

heakV ddT 


Eelec  ~  Ptr  +  2//. 


(5-8) 


V2 

where  P  =  H-^ 
Rn 


,  tr  is  the  rise  and  time  (assumed  equal  to  the  fall  time)  of  a  switching  transition. 


1/T  is  the  operating  frequency  of  the  system,  Vdd  is  the  supply  voltage,  and  Iieak  is  the  leakage 
current  of  the  vias.  The  interesting  result  from  the  delay  and  switching  energy  requirement  models 
is  that  both  increase  linearly  with  the  interconnection  length  (Ljnt)  on  the  interconnection  layer. 

We  now  have  to  derive  the  equivalent  energy  model  for  an  optical  link  (Eopt)  in  order  to 
compare  electronic  and  optoelectronic  technologies  for  interconnections  in  the  3-D  computer 
environment.  Fig.  (5-2b)  shows  the  simplified  model  that  we  used  for  an  optical  link.  We  assume 
a  VCSEL  as  a  light  transmitter  with  a  threshold  current  (Ith  =  1  raA),  a  bias  voltage  VDD,  a  laser 
current  (Iias),  and  a  conversion  efficiency  (til).  We  also  assume  a  p-n  junction  diode  receiver 
(Rload)  Cdet)  with  a  quantum  efficiency  (tid  =  0-3)  that  generates  a  photo  current  (Ideth  followed 
by  an  amplifier  circuit  of  gain  A.  The  free-space  link  between  the  transmitter  and  the  receiver  is 
modeled  by  a  simple  optical  transmission  efficiency  (r|o). 

Assuming  a  voltage  power  supply  Vdd  =  5V  identical  for  both  transmitter  and  receiver  circuits, 
an  operating  frequency  of  1/T  (with  a  rise  time  tr  and  a  fall  time  tf),  and  a  standard  detector  capacity 
(Cdet  —  1 10  fF  for  a  standard  Si  PIN  diode  in  1.2  pm  technology)  the  energy  requirement  for  the 
optoelectronic  link  is  simply: 


Eopt  —  V DD  0  {jlas  /det  "l"  / th  ) 


(5-9) 


In  this  equation  the  DC  power  consumption  in  the  CMOS  receiver  is  neglected  as  it  will  always  be 
much  less  that  the  other  terms.  Also  note  that  Idel  only  becomes  significant  at  very  high  speeds. 
Considering  the  speed  requirement  of  the  detector  ( Ida  =  Qet(  — )  and  the  relation  between 

the  required  laser  current  and  the  detector  current  (h,=—- ~  - )  Eq.  (5-9)  reduces  to: 


Eop,  =  « 


C  V 

^det  v  DP 

2  Art 


+  - 


TLX 


th  DP 


per  bit 


(5-10) 
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with  r)  =  riLxrioxriD  and  a  =  T/tr.  Note  that  the  first  term  of  Eq.  (5-10)  is  the  photo  current 
required  for  switching  the  connection  while  the  second  term  is  the  energy  required  to  maintain  the 
VCSELatthe  threshold  current  value.  The  parameters  which  present  the  most  interest  in  Eq.  (5- 
10)  are  the  product  r|A,  which  represents  the  overall  efficiency  of  the  optoelectronic  interconnect 
(including  the  detection  circuit).  It  can  be  easily  seen  that  for  a  given  operating  frequency  (1/T)  the 
required  switching  energy  is  directly  related  (inversely  proportional)  to  Ar\  whereas  for  small 
values  of  T  (large  operating  frequency)  the  second  term  of  equation  10  becomes  negligible.  For  the 
rest  of  the  calculations  we  assume  that  a  =  5  in  Eq.  (5-10),  which  means  that  the  rise  time  (tr)  is 
assumed  to  be  20%  of  the  operating  period  (this  is  usually  considered  an  acceptable  value). 

Figure  (5-3a)  shows  the  plot  of  switching  energy  requirements  for  both  technologies.  For  each 
line  length  (Ljnt),  the  operating  frequency  is  taken  as  the  maximum  achievable  by  the  electronic 
line.  This  makes  the  second  term  of  Eq.  (5-10)  negligible  and  the  optical  energy  becomes  a 
constant  essentially  independent  of  line  length.  It  can  be  seen  that  at  the  highest  achievable  channel 
speed,  an  electrical  connection  in  the  3-D  computer,  requires  3  to  20  times  more  switching  energy 
than  its  optoelectronic  counterpart. 

In  the  energy  model  that  we  have  developed,  it  can  be  seen  that  there  are  two  independent 
contributions  to  the  total  electrical  switching  energy.  One,  Ever>  is  due  to  the  vertical  connections 
(source,  bridges,  vias,  destination)  and  is  a  constant,  independent  of  connection  length.  The  other, 
Ejw,  is  due  to  the  line  on  the  interconnection  layer  and  increases  with  line  length. 

E, elec  ~  Ever  3"  ^iw  (5-1 1) 

Therefore,  for  a  given  operating  frequency,  we  can  find  the  break-even  line  length,  at  which  it 
becomes  advantageous  to  use  an  optical  link.  From  the  previous  equations  it  can  be  shown  that  if 
the  product  Ar\  is  greater  than  0.15,  then  Ever  >  Eopt  and  it  is  always  advantageous  to  use 
optoelectronics.  If  Ar\  is  smaller  than  0.15  then  there  will  be  a  domain  (up  to  a  specific  break-even 
length)  where  it  is  advantageous  to  use  an  interconnection  layer.  Note  that  a  gain-efficiency  product 
of  0.15  means  an  optoelectronic  system  with  a  wall  plug  efficiency  of  1.5  %  and  a  receiver  circuit 
gain  of  10  which  is  readily  achievable  today  with  optoelectronic  technologies.  In  addition,  this 
assumes  that  a  single  inverter  is  used  as  an  amplifier  in  the  receiver  circuit,  limiting  the  range  of 
possible  values  of  A  to  less  than  20.  If  more  gain  was  required  in  the  receiver,  more  complex 
designs  such  as  multi-stage  transimpedance  receiver  circuits  could  be  used.  However,  the  power 
consumed  in  the  receiver  would  also  be  increased. 
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ENERGY  REQUIREMENT  OF 
OPTOELECTRONIC  VS  ELECTRICAL  INTERCONNECTS 
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Figure  5-3.  (a)  Energy  requirements  vs.  line  length  for  optoelectronic  and  electronic 
interconnections  in  the  3-D  computer,  (b)  Break-even  line  length  for  optoelectronic  vs 
electronic  connections  in  the  3-D  computer. 
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The  results  on  the  break-even  line  length  are  shown  in.Fig.  (5-3b).  For  example  at  the  present 
speed  of  a  few  tens  of  MHz  in  a  3-D  computer,  it  already  is  more  efficient  to  use  free-space 
optoelectronics  interconnects  with  a  gain-efficiency  product  of  only  0.01  (e.g.,  wall  plug  optical 
efficiency  of  0.1%  and  receiver  circuit  gain  of  10)  for  connections  greater  than  8  cm.  Such  wire 
lengths  will  exist  in  a  system  implemented  on  4-inch  wafers  or  equivalent  that  require  global 
connections.  This  is  a  very  encouraging  result  because  today’s  optoelectronic  technology  already 
allows  such  gain-efficiency  products  in  free-space  links. 

5.2  OPTICALLY  AUGMENTED  3-D  VLSI  COMPUTER 

It  has  been  shown  in  the  previous  sections  that  global  interconnections  are  advantageous  for 
many  operations  in  a  parallel  system.  In  addition,  free-space  optoelectronic  technologies  offer  good 
performance  potential  for  implementing  such  connections. 

5.2.1  The  Optical  Transpose  Interconnection  System 

The  Optical  Transpose  Interconnection  System  (OTIS)  is  a  one-to-one  interconnection  between 
P  transmitters  {Pt}  and  P  receivers  {Pr},  where  Pt  and  Pr  range  from  0  to  P-1  and  P  is  the  product 
of  two  integers,  M  and  N.  Since  P  =  MN,  the  indices  Pt  and  Pr  can  be  divided  into  ordered  pairs 
(nt,  mt)  and  (mr,  nr)  respectively,  where  mt  and  mr  range  from  0  to  M-l  and  nt  and  nr  range  from 
0  to  N-l,  such  that: 

Pt  =  M  nt  +  mt ;  Pr  =  N  mr  +  nr 

nt  =  trunc  (  Pt/M) ;  mr  =  trunc  ( Pr/N)  (5-12) 

mt  =  Pt  modulo  M  ;  nr  =  Pr  modulo  N 

The  indices  nt  and  mr  are  referred  to  as  the  major  indices  of  the  transmitters  and  receivers, 
respectively,  mt  and  nr  are  called  the  minor  indices,  and  Pt  and  Pr  are  referred  to  as  the  scalar 
indices.  In  the  transpose  interconnection,  (nt,  mt)  is  connected  to  (mr,nr)  if  and  only  if  mt=mr  and 
nt=nr.  Such  an  interconnection  is  called  an  MxN  transpose. 

An  important  application  of  OTIS  is  in  support  of  multistage  interconnection  network 
architectures  (OTIS-MEN)  based  on  K-shuffles.(5'5)  A  K-shuffle  MIN  functionally  provides  full 
connectivity  between  P  input  channels  (PEs)  and  P  output  channels  (PEs)  in  LogKP  stages  of 
optoelectronic  switch  planes  and  (LogKP)-l  stages  of  optical  K-shuffles.  One  optoelectronic  switch 
has  K  optical  inputs  (receivers)  and  K  optical  outputs  (transmitters)  and  provides  full  routing 
between  its  K  channels.  In  a  K-shuffle,  the  receiver  P0  to  which  Pj  is  connected  is  given  by: 

P0  =  K  (Pi  modulo  (P/K)}  +  trunc  (  Pj/P)  (5-13) 
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where  trunc(x)  is  defined  as  the  greatest  integer  <x.  If  M  is  set  equal  to  P/K  and  N  is  set  equal  to  K 
then 


P0  =  Nmi  +  ni  (5-14) 

Equating  (12)  with  (14)  for  arbitrary  values  of  Pi  implies  that  n0=ni  and  m0=mi.  Therefore,  a  K- 
shuffle  is  equivalent  to  a  (P/K)xK  transpose.  An  interesting  application  of  the  OTIS-MIN  arises 
when  K=(MN)l/2  sjnce  the  number  of  stages  for  routing  becomes  a  constant: 


LogK  P  -  l°8Vmn  MN  -  2 


(5-15) 


In  this  case  only  2  stages  of  optoelectronic  switches  and  1  stage  of  optics  are  required  to  perform 
full  routing  between  source  and  destination  planes,  independent  of  the  total  number  of 
communication  channels  (P)  in  the  MIN.  As  described  in  Fig.  (5-4),  the  principle  of  the 
optoelectronic  system  for  routing  application  requires  2  stages  of  optoelectronic  switching  elements 
and  a  single  stage  of  optical  interconnects.  This  optical  interconnection  system  consists  of  two 
identical  planes  of  diffractive  optical  elements  containing  -JP  lenslets.  For  use  with  in  an  OTIS- 
MIN  system,  each  optoelectronic  stage  of  the  network  contains  4p  optoelectronic  VPxa/P- 
switches  (switches  with  VP  inputs  and  4~P  outputs),  arranged  in  a  square  2-D  array  with  the 
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element  planes 


Figure  5-4.  OTIS  as  a  a/P -shuffle 


interconnection. 
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modulators  and  the  detectors  being  uniformly  distributed  in  the  plane.  Note  that  in  a  related  study 
the  area,  the  power  consumption  and  speed  of  such  approach  have  been  analyzed  and  proven  to  be 
near  optimal/5'5’ 

5.2.2  Additional  Consideration  on  OTIS 

Another  important  feature  of  the  Optical  Transpose  Interconnection  System  is  that  it  can  be 
made  bi-directional.  This  is  crucial  in  the  3-D  VLSI  computer  environment  since  it  does  not  put  any 
restriction  on  the  vertical  through  wafer  electronic  bus.  The  implementation  of  the  bi-directional 
OTIS  optical  system  is  shown  in  Fig.  (5-5).  Plane  1  and  Plane  2  are  now  both  transmitter/receiver 
planes.  Plane  2  is  actually  rotated  180°  with  respect  to  plane  1  so  that  a  transmitter  on  plane  1  faces 
a  receiver  on  plane  2  and  vice-versa.  Beam  blocks  between  adjacent  lenses  are  required  in  order  to 
avoid  crosstalk.  These  beam  blocks  are  located  on  the  lower  half  of  the  transmitter  lenses  and  the 
upper  half  of  the  receiver  lenses  thereby  reducing  slightly  the  light  efficiency  of  the  system. 
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An  illustration  of  the  schematic  assembly  of  the  OTIS-MIN  in  the  3-D  computer  environment  is 
given  in  Fig.  (5-6).  In  this  case,  only  one  power  coupling  element  (usually  a  Polarizing  Beam 
Splitter)  is  required  for  the  implementation  of  a  single  stage  OTIS.  This  element  also  provides  the 
mechanical  structural  support  for  the  system.  The  optical  data  (dashed  lines  on  Fig.  (5-6))  is  then 
transmitted  from  the  3-D  stack  1  to  3-D  stack  2  and  back  via  the  bi-directional  OTIS  optical  system. 
The  power  for  the  modulators  is  routed  to  the  OEICs  (solid  lines  on  Fig.  (5-6))  through  an  optical 
system  that  uses  identical  lenses  as  the  OTIS  interconnection  system.  It  is  interesting  to  notice  that 
new  technologies  such  as  the  Birefringent  Computer  Generated  Holograms(5  6)  allow  dramatic 
simplifications  of  the  optical  system  by  providing  the  possibility  to  combine  in  the  same  element 
the  spot  array  generator  and  the  OTIS  lenses. 
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Figure  5-6.  Principle  of  the  assembly  of  the  optically  augmented  3-D  VLSI  computer. 


5.2.3  Time  Dilated  Network 

The  metric  of  importance  to  evaluate  the  performance  of  a  network  is  its  bandwidth.  The 
bandwidth  of  the  network  is  the  expected  number  of  network  requests  accepted  per  unit  time. 
Network  bandwidth  is  defined  as  the  product  of  system  clock  speed,  network  size  (N),  and 
probability  that  an  arbitrary  request  will  be  accepted  by  the  network  (Pa).  When  two  packets  are 
routed  to  the  same  output  of  the  2x2  basic  switching  element,  it  is  assumed  that  one  randomly 
chosen  packet  is  dropped.  Destination  addresses  for  the  packets  are  generated  independently,  with 
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uniform  probability  P.  Under  these  assumptions,  it  can  be  shown  that  the  average  bandwidth  of  the 
2-D  shuffle-exchange  network  for  2x2  grain,  for  large  values  of  N,  is  given  by: 


BW2  =  F  N  Pa  =  F  N 


4 

Log2N 


(5-16) 


where  F  is  the  network  clock  speed  and  Pa  =  4/Log2N.  It  should  be  noted  that  the  worst  case 
bandwidth  can  be  as  low  as  O (vN),  and  the  worst  case  includes  important  permutations  of  bit 
reversal  and  matrix  transpose.  The  formula  given  in  Eq.  (5-16)  is  also  valid  for  MINs  with  KxK 
grain  since  each  KxK  grain  is  built  using  simple  2x2  switching  elements. 

Figure  (5-7a)  shows  the  analytical  acceptance  rate  of  the  network  as  a  function  of  network 
size.  The  dashed  line  on  the  plot  shows  the  standard  reference  point  for  network  performance 
which  is  that  of  a  crossbar.  The  other  lines  are  the  performance  of  our  network  for  various 
conditions;  the  line  marked  lx  shows  the  acceptance  rate  for  the  standard  network.  In  the  case  of 
the  optically  augmented  3-D  computer  with  a  one-stage  OTIS  based  shuffle  network,  Fig.  (5-7a) 
shows  that  the  time  dilated  network  concept<5'7)  offers  great  potential  for  lowering  the  blocking 
probability  of  the  system.  This  concept  is  very  simple,  it  assumes  that  the  network  can  be  operated 
faster  that  the  processors  attached  to  the  network  and  that  blocked  data  packets  are  buffered  at  the 
input  of  the  network  and  then  resent  if  they  have  been  blocked.  In  this  approach,  it  is  a  practical 
solution  since  the  clock  frequency  of  the  processors  is  only  a  few  tens  of  MHz  and  the  data  packet 
size  is  only  16  bits.  Note  that  the  performance  shown  on  Fig.  5-7(a)  are  for  random  traffic  only. 
For  permutation  traffic,  the  acceptance  rate  of  a  Crossbar  is  one  and  the  acceptance  rate  of  our  time 
dilated  network  is  expected  to  approach  one.  To  evaluate  the  performance  of  the  time-dilated 
network  under  permutation  traffic  conditions,  we  performed  simulations  for  which  the  results  are 
given  in  Fig.  5-7 (b).  These  results  are  based  on  the  average  of  1,000  permutations.  It  shows  that 
for  a  network  with  4096  channels  only  10  rounds  are  required  for  routing  the  complete 
permutation.  This  means  that  a  lOx  time-dilated  network  will  route  a  random  permutation  with 
probability  one.  However,  it  should  be  noted  that  there  will  always  be  a  worst-case  routing  pattern 
that  will  require  777  rounds  in  the  network  for  complete  routing.  This  is  the  case  for  example  of 
the  identity  permutation.  Such  patterns  can  be  detected  a  priori  and  time  can  be  allocated  to  perform 
the  routing.  Note  also  that  in  this  network  routing  a  transpose  operation  permutation,  which  is 
usually  a  worst-case  routing  pattern  for  most  networks,  will  only  require  one  routing  step. 
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Figure  5-7.  Performance  of  the  time-dilated  network  for  (a)  random  traffic  and  (b)  permutation 
traffic. 
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5.3  APPLICATIONS  STUDIES 


5.3.1  FFT  Studies 

One  of  the  important  application  areas  where  silicon  based  systems  remain  inadequate  is  the 
manipulation  and  processing  of  large  data  arrays  as  required,  for  example,  by  image  processing 
and  understanding  applications.  Although  many  parallel  algorithms  exist  in  this  domain,  their 
mapping  on  existing  system  architectures  usually  lead  to  unsatisfactory  performance  when  the 
image  sizes  are  large.  This  is  because  the  computational  speed  of  the  algorithms  that  require  such 
manipulation  of  large  data  arrays  is  presently  limited  by  hardware  considerations  rather  than  by 
software  and/or  algorithm  mapping.  Thus,  new  hardware  solutions  capable  of  performing 
operations  such  as  histogram,  centroid,  moment,  segmentation,  and  FFT  calculations  at  high 
speeds  are  critically  needed  for  the  implementation  of  real  time  and  adaptive  signal  processing 
systems.  For  example,  if  such  operations  could  be  performed  on  1024x1024  data  arrays  with  32 
bit  accuracy  at  high  speed  video  rates,  significant  improvements  could  be  achieved  in  electronic 
warfare  and  industrial  control  applications. 

The  present  hardware  limitations  result  from  an  intricate  relationship  between  chip  area, 
allowable  power  dissipation,  and  the  number  of  channels  available  for  I/O,  memory  access,  and 
inter-chip  communication  as  well  as  their  respective  bandwidth.  Indeed,  as  the  size  of  the  data 
arrays  increase,  it  becomes  exceedingly  difficult  to  pack  the  required  processing  units  and 
sufficient  memory  within  one  single  chip  to  carry  out  the  required  computation  at  high  speeds.  On 
the  other  hand,  the  use  of  parallel  hardware  with  many  chips  is  very  difficult  because  of  the 
bandwidth  required  by  I/O,  inter-chip  communication,  and  memory  access.  In  this  case,  the  delays 
and  power  required  to  support  the  required  communication  become  prohibitively  large. 

In  this  report,  we  focus  our  efforts  on  a  2-D  free-space  optoelectronic  FFT  engine  capable  of 
processing  1024x1024  or  larger  images.  The  selection  of  FFT  as  an  application  example  stems 
from  two  reasons.  First,  FFT  operations  are  essential  in  many  digital  image  processing 
applications  and  the  implementation  of  FFT  processors  has  been  studied  in  detail  by  the  electronics 
community.  Thus,  performance  benchmarks  are  readily  available  which  enable  us  to  identify  the 
benefits  of  the  proposed  optoelectronic  approaches.  Second,  many  systems  would  benefit 
significantly  from  faster  2-D  FFT  engines.  For  example,  faster  and  more  efficient  2-D  video 
compression  units  and  real  time  adaptive  noisy  image  recovery  systems  could  be  implemented. 

Section  5. 3. 1.1  provides  a  background  on  FFT  algorithms  and  outlines  the  state  of  the  art  in 
current  and  projected  FFT  performance  on  all  electronic  systems  including  dedicated  chip  sets, 
more  general  DSP  chips,  and  supercomputers.  The  roadblocks  and  limitations  of  these  FFT 
systems  are  then  outlined.  Section  5. 3. 1.1  also  describes  chip  to  chip  digital  interconnection 
technologies  such  as  MCM  and  free-space  optical  interconnects.  In  Section  5. 3. 1.2,  we  discuss 
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two  possible  OTIS  based  tree-space  optoelectronic  2-D  FFT  architectures  that  perform  lKxlK- 
point  FFTs.  We  also  estimate  the  potential  performance  of  the  proposed  architectures  if  we  were  to 
implement  them  with  present  state  of  the  art  MCM  technologies  and  extend  our  analysis  to  future 
MCM  technologies  for  the  next  five  years.  Then,  we  compare  both  optoelectronic  approaches  to 
their  MCM  interconnect  based  counterparts  for  lKxlK  points  FFT  operations.  In  all  cases,  we 
show  that  from  a  pure  performance  point  of  view,  a  lKxlK-point  or  larger  FFT  engine  operating 
at  1,000  frames  per  second  should  be  implemented  with  free-space  optical  interconnects.  Finally, 
Section  5.3. 1.3  presents  some  discussions  and  conclusions  about  the  findings  of  this  study. 

5. 3. 1.1  Background 

5. 3. 1.1.1  FFT  Computation 

The  Discrete  Fourier  Transform  (DFT),  which  expresses  a  discrete-valued  function  as  a 
superposition  of  a  discrete  set  of  frequencies,  has  applications  in  many  domains.  However,  a 
straightforward  formulation  of  an  N-point  DFT  requires  N2  steps  to  compute.  Fast-Fourier 
Transform  algorithms,  such  as  those  attributed  to  Cooley  and  Tukey,(5'8)  can  reduce  this  to 
approximately  N  LogN,  a  significant  difference  for  even  moderately  large  transform  sizes. 

One  way  to  generate  such  FFT  algorithms  is  to  begin  with  FFT  algorithms  for  fixed  sized 
transforms.  For  example,  a  2-point  FFT  algorithm  is: 

F(0)  =  f(0)  +  f(l) 

F(l)  =  f(0)  -  f(l)  (5-17) 

where  f  is  the  original  function  and  F  is  its  Fourier  Transform.  Such  an  algorithm  would  then  be 
used  as  a  building  block  in  a  larger  algorithm  using  a  mixed-radix  approach.  Intuitively,  applying 
this  2-point  transform  across  the  entire  data  can  be  thought  of  as  a  filter  which  divides  the  data  into 
low  and  high  frequency  components.  Another  stage  of  2-point  transforms  divides  each  of  those 
components  into  two  components  which  cover  narrower  frequency  windows.  By  repeating  this 
LogN  times,  we  obtain  an  N-point  transform.  Even  more  efficient  algorithms  can  be  obtained  by 
using  an  8-  or  16-point  transform  as  the  building  block.  The  resulting  algorithms  are  then  termed 
radix-8  or  -16  FFTs.  Some  additional  work  is  required  between  stages  in  order  to  implement  the 
intuitive  picture  described  above.  The  data  must  be  multiplied  by  a  constant  (often  called  a  “twiddle 
factor”)  and  the  data  array  must  be  rearranged.  This  rearrangement  poses  a  serious  problem  for 
parallel  implementation  of  FFTs.  Between  each  stage,  each  processor  will  be  required  to  exchange 
data  with  many  other  processors  in  order  to  accomplish  the  data  motion  needed  to  rearrange  the 
array.  Thus,  a  parallel  architecture  for  FFTs  must  be  able  to  support  such  communication. 
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The  FFT  algorithms  described  above  were  for  1-D  data.  To  perform  transforms  on 
2-dimensional  data  (e.g.  images),  similar  techniques  are  used.  The  2D  data  is  partitioned  into  sub¬ 
images;  for  example,  each  row  can  be  considered  as  a  sub-image.  In  this  case,  a  1-D  FFT  is 
performed  on  each  sub-image,  using  the  methods  outlined  above.  The  data  array  is  then 
rearranged;  in  our  example,  each  column  would  now  become  a  sub-image:  this  operation  requires  a 
transpose  of  the  2-D  pattern.  In  parallel  FFT  system,  this  can  cause  serious  communication 
bottlenecks  termed  commonly  the  “comer  turn”.  Finally,  a  second  1 -dimensional  FFT  is 
performed,  producing  the  complete  transform  of  the  image.  Note  that  the  sub-image  could  very 
well  be  a  2-D  subsection  of  the  original  image  in  which  case  the  building  block  of  the  global  2-D 
FFT  is  a  smaller  2-D  FFT.  In  any  case  global  communications  are  required  in  parallel  FFT 
systems. 

5. 3. 1.1.2  Electronic  FFT  Systems 

The  wide  applicability  of  the  FFT  has  led  to  many  implementations  of  the  algorithms  on  a  wide 
variety  of  hardware.  At  one  end  of  the  spectrum,  commercially  available  Digital  Signal  Processing 
(DSP)  chips  are  widely  used  in  small-scale  applications  which  require  FFT  computations.  Among 
the  best  available  are  the  Analog  Devices  ADSP-21060  and  the  Texas  Instruments  TMS-320-C80, 
capable  of  performing  a  IK-point  1-D  FFT  in  approximately  460  psec  and  160  Jisec,  respectively. 
There  also  exist  dedicated  FFT  chip-sets,  usually  consisting  of  a  specialized  FFT  processor, 
memory  chips,  and  address  generators.  The  Sharp  LH9125/LH9320  and  the  Plessey  PDSP16150 
are  examples  of  such  systems,  performing  IK-point  1-D  FFTs  in  87  psec  and  96  psec, 
respectively/5"9-1  Certainly,  technological  advances  in  VLSI  design  will  lead  to  faster  and  faster 
systems.  One  proposed  next-generation  dedicated  FFT  chip  would  potentially  offer  10  times  the 
performance  of  an  existing  FFT  chip-set/5" l0)  However,  none  of  these  chips  has  enough 
computational  power  to  provide  lKxlK  FFTs  at  a  rate  of  1,000  frames/second.  For  all  of  these 
chips,  the  power  consumption,  area  and  cost  are  either  known  or  can  be  projected.  Table  5-2 
summarizes  the  results  for  the  computation  of  a  IK-point  1-D  FFT  and  lKxlK-point  2-D  FFT. 

The  areas  for  the  Analog  Device  and  Texas  Instruments  chips  are  an  estimate  based  on  the  fact 
that  these  chips  are  engineered  to  operate  without  glue-logic.  In  the  case  of  a  large  2-D  FFT 
performance,  they  would  actually  require  additional  external  memory  chips.  Area  figures  for  the 
Plessey  and  Sharp  chips  are  for  the  chip-set  which  include  the  area  required  by  interface,  address 
generator,  and  memory  chips.  The  speed  estimate  of  the  2-D  FFT  computation  is  based  on  the 
given  speed  of  the  1-D  FFT  and  the  internal  architecture  of  each  chip,  i.e.  using  a  NlogN  speed 
scaling  to  evaluate  2-D  FFTs  speed  from  1-D  FFT  numbers.  We  also  neglect  any  overhead  that 
would  result  from  external  memory  access.  Thus,  the  actual  computation  times  will  be  somewhat 
larger  for  the  2-D  FFT  case  leading  to  an  optimistic  performance  evaluation  of  these  chips  for  2-D 
FFTs. 
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TABLE  5-2.  Performance  Comparison  of  DSP  and  dedicated  FFT  Chip-Sets. 


Chip 

Power 

(W) 

Cost 

($) 

Area 
(in2)  • 

Precision 

(bits) 

Speed  (psec) 
IK  FFT 

Speed  (msec) 
lKxlK  FFT 

AD  21060 

3.6 

250 

1 

32  float. 

460 

920  j 

11  C80 

10 

400 

1 

32  float. 

160 

326 

Plessey 

11 

3,000 

5 

16  block 

96 

192 

Sharp 

30 

4,000 

18 

24  block 

87 

174 

Future 

Chip 

6 

N/A 

•Hi;;;.',-  ■■ 

2.8 

24  block 

8.3 

< .  16  ... 

Table  5-2  shows  that  special  purpose  chips  outperform  the  more  general  purpose  DSPs  at  the 
expense  of  power  consumption,  system  area  and  cost.  From  the  table  it  is  also  clear  that  none  of 
the  currently  available  chips  can  provide  the  kind  of  performance  required  for  implementing  a 
1,000  frames-per-second  lKxlK-point  complex  FFT.  Even  making  use  of  a  projected  next- 
generation  chip  and  ignoring  the  cost  of  inter-processor  communication,  16  chips  would  be 
required  to  provide  that  level  of  performance.  Alternatively,  one  could  use  16  of  these  next 
generation  chips  in  a  pipeline  architecture,  and  in  this  case,  the  1,000  frames  per  seconds 
performance  could  be  achieved.  This  would  only  be  possible  at  the  cost  of  a  very  large  latency: 
16  msec,  not  adequate  for  any  adaptive  systems. 

To  obtain  dramatic  speed-up  over  the  existing  one  chip  systems,  it  will  be  necessary  to 
combine  many  chips  together  in  a  parallel  system.  Indeed,  FFT  computation  is  an  important  use  of 
existing  supercomputers.  However,  only  large  supercomputers  have  the  computational  power 
necessary  to  solve  large  2-D  FFTs  at  high  speed.  Sample  codes  written  by  Aware  Inc.<5'n) 
provided  the  performance  shown  in  Table  5-3. 

Table  5-3  shows  that  even  a  supercomputer  with  relatively  large  parallelism  can  not  achieve  the 
desired  performance.  Note  that  other  supercomputers  (such  as  the  Fujitsu)  are  given  to  be  faster 
than  the  ones  mentioned  in  that  table,  however  no  2-D  FFT  benchmark  data  was  available  for  the 
faster  machines.  In  any  case,  the  improvement  will  be  far  less  than  the  order  of  magnitude  required 
to  even  approach  the  desired  performance.  Table  5-3  also  shows  that  for  systems  with  multiple 
processors,  doubling  the  number  of  processors  does  not  double  the  speed  of  the  system.  This  is 
mostly  due  to  the  inefficiencies  in  the  memory  I/O  and  the  communication  between  the  processors 
as  their  number  increases. 

Thus,  in  order  to  offer  a  good  technological  solution  to  the  problem  of  a  2-D  lKxlK-point 
complex  FFT  at  1,000  frames  per  seconds  the  common  roadblocks  of  all  these  FFT  systems  must 
first  be  identified. 
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TABLE  5-3.  Summary  of  2-D  Complex  FFT  Performance  of  Various  Super-Computer 
Configurations. 


Intel  Paragon 

Number  of  nodes 

:V  ■ ;  . 

lKxlK  FFT 
(msec) 

IBM  SP2 

Number  of  nodes 

lKxlK  FFT 
(msec) 

2 

818 

1  '  pjlfi ..  "  :* 

4  r 

428 

4 

280 

8 

225 

8 

150 

"  16  ' 

126 

f  ,  r'Mf'*'  .  .;> 

80 

3  2 

80 

32 

. rft,;  ' 

40 

t  /  '  illfe  '  •  ';X: :;S;3l5 
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5.3. 1.1. 3  Present  Roadblocks  for  Large-Scale  FFT  Systems 

The  limitations  of  the  existing  and  future  electronic  solutions  to  the  FFT  problem  are 
summarized  in  this  section.  They  fall  into  four  broad  categories: 

•  For  single  chip  electronic  solutions,  computation,  i.e.  the  speed  of  the  Floating  Point  Units 
(FPUs),  is  the  main  limiting  factor. 

•  As  the  computation  problem  is  solved,  through  parallelism,  the  speed  of  inter-processor 
communication  become  critical.  If  instead  of  using  a  fully  interconnected  system  one  chose 
to  utilize  a  highly  pipelined  architecture  then  latency  becomes  a  significant  issue. 

•  As  the  size  of  the  images  on  which  the  FFT  is  performed  increases,  memory  and  FPUs  will 
not  be  on  the  same  chip.  In  this  case,  high-speed  processor  to  memory  communication 
might  become  a  limitation. 

•  As  the  required  frame  rates  at  which  the  FFT  needs  to  be  performed  increase,  I/O  also 
becomes  a  critical  point.  For  example,  a  fast  frame  system  should  be  able  to  handle  a 
sustained  64  Gbits/sec  I/O  for  lKxlK-point,  1,000  frames  per  second,  32  bit  precision 
complex  FFT  and  1  Tbit/sec  for  1,000  frames  per  second  4Kx4K-point  FFTs. 

Thus,  a  parallel  system  architecture  based  on  the  combination  of  state-of-the-art  FPU  technologies 
and  high-speed  digital  interconnects  for  both  I/O  and  inter-chip  communication,  with  the 
appropriate  packaging  technology,  will  offer  the  best  solution  to  the  problem. 

5. 3. 1.1. 4  Interconnection  Technologies  Modeling 

Since  the  FFT  problem  is  highly  parallel,  parallel  implementation  of  FFT  engines  for  large 
scale  problems  seems  natural.  However,  such  systems  become  quickly  limited  by  the 
interconnections.  In  order  to  evaluate  these  limitations  and  further  extract  the  benefits  of  an 
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optoelectronic  approach  with  respect  to  all  electronic  solutions,  we  summarize  here  the  state  of  the 
art  in  present  and  near  future  interconnection  technologies,  both  electronic  and  optoelectronic.  In 
Subsection  5.3. 1.2,  we  will  use  the  interconnection  models  described  here  to  analyze  the  expected 
overall  performance  of  the  proposed  optoelectronic  FFT  engines.  The  modeling  is  based  on  work 
performed  at  UCSD  which  includes  optoelectronic  devices  such  as  YCSELs  and  modulators,  their 
associated  receiver  and  transmitter  circuits, (512)  free-space  optical  systems  based  on  diffractive 
optics, (5I3)  and  on-chip  and  off-chip  electronic  interconnections/5'141  The  latter  will  be  particularly 
useful  when  we  compare  our  optoelectronic  approach  to  a  fully  electrical  approach  based  on  MCM 
implementation.  The  modeling  work  is  based  on  present  and  projected  parameters  as  listed  in 
Table  5-4.  It  takes  into  account  the  fact  that  processor  chips  are  in  general  implemented  with 
technologies  that  are  one  or  two  generations  behind  DRAM  technology.  For  1995  parameters,  we 
only  considered  commercially  available  chips.  However,  we  assumed  state  of  the  art  off-chip 
electrical  interconnections.  In  addition,  we  assumed  that  by  the  year  2000,  new  ceramic  materials 
will  be  available  for  MCMs  yielding  a  drop  of  the  interconnect  capacitance  (per  unit  length)  by  a 
factor  of  at  least  3. 

On-chip  interconnections 

On-chip  electrical  connections  are  important  in  both  an  all  electrical  as  well  as  an  optoelectronic 
implementation.  For  example,  in  the  optoelectronic  approach  described  in  Section  5.3. 1.2,  there 
will  be  a  need  for  evaluating  the  performance  of  on-chip  electrical  connections.  Thus,  we  have 
calculated  the  power  requirement  for  lines  of  various  lengths.  In  this  analysis,  it  is  assumed  that 
optimally  sized  repeaters  are  used  along  the  lines  in  order  to  minimize,  delay.  In  Table  5-5  and 
Table  5-6,  the  power  and  energy  are  related  linearly  which  means  that  the  power  at  600  MHz  is 
approximated  to  simply  10  times  more  than  at  60  MHz. 

Off-Chip  Interconnections 

As  will  be  shown  in  Section  5.3. 1.2.3,  we  will  compare  our  optoelectronic  approaches  to  all 
electrical  approaches.  The  most  performant  all  electrical  packaging  approaches  to  date  and  in  the 
near  future  remain  MCM  type  implementations.  Table  5-4  lists  the  parameters  and  their  values  for 
the  best  implementations  reported  in  the  literature  to  date,  as  well  as  5  year  projections.  MCMs  are 
assumed  to  be  copper  wires  on  co-fired  ceramic  designated  as  MCM-C.  Using  these  parameters  we 
can  derive  the  energy  requirements  and  maximum  operation  frequency  of  an  MCM  connection.  In 
the  derivations,  we  assume  that  the  lines  are  lossless  and  that  the  cost  of  the  vias  in  terms  of  power 
and  delay  is  negligible.  Note  that  for  lossless  lines  the  maximum  frequency  of  operation  is  only 
limited  by  the  impedance  matched  driver  of  the  line.  All  the  lines  are  assumed  to  be  series 
terminated  in  order  to  minimize  their  power  consumption.  If  parallel  or  active  termination  is  used, 
the  maximum  operation  frequency  of  a  line  will  increase  by  20  to  30%  for  an  increase  of  2  to  3  in 
power  (this  is  the  case  both  for  today’s  and  future  technologies).  These  assumptions  lead  to  results 


67 


that  are  quite  optimistic  for  the  MCM  implementation.  In  order  to  provide  insight  as  to  what  the 
performance  of  such  interconnections  are,  some  results  for  typical  line  lengths  are  summarized  in 
Table  5-7. 


TABLE  5-4.  Today’s  and  Future  Parameters  for  MCM  Implementations. 


.Parameter 

Symbol 

1995 

2000 

"Chip  Spacing 

D 

fp5  mm 

2  mm 

Wire  pitch 

d 

75  pm 

25  pm 

jj&C"  '  .  ■  -  :.:x  -  .!=!  " -. 

‘Number  of  layers 

L 

:  10 

40 

Flip-chip  bump  capacitance 

Cpin 

500  fF 

100  fF 

Via  depth 

Lvia 

200  pm 

100  pm 

Propagation  speed 

V 

15  109  cm/sec 

18  109 

1  1  •  -.1  if'”'  V 

cm/sec 

Off-chip  capacitance/unit  length 

Cintoff 

2  pF/cm  ' 

fci  &  •  •  * 

0.6  pF/cm 

Off-chip  resistance/unit  length 

Rintoff 

0.95  Q/cra 

0.5  O/cm 

CMOS  technology 

2X 

0.8  pm 

0.25  pm 

Voltage  Supply 

Vdd 

|s  ff:- . .  _ 

3.3  V 

Transistor  Threshold  Voltage 

Vth 

.85  V 

0.6  V 

Minimum  transistor  resistance 

Rmin 

3.3  KQ 

5  KO 

Minimum  Logic  Time  Constant 

RCmin 

■  .  0.2  nsec 

0.1  nsec 

Minimum  transistor  transconductance 
parameter 

Ktr 

••  so  pA/v2  jy 

80  pA/V2 

Minimum  inverter  input  capacitance 

Cmini 

15  fF 

10  fF 

:  Minimum  inverter  output  capacitance 

Cmino 

is 

12  fF 

1  On-chip  capacitance/unit  length 

Cinton 

440  fF/cm 

150  fF/cm 

On-chip  resistance/unit  length 

Rinton 

l  600  n/cm  . 

1.5  KQ/cm 
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TABLE  5-5.  Power  Consumption  of  On-Chip  Wires  for  0.8  pm  CMOS. 


Wire  length  (mm) 

Energy  (pj) 

RC  limit 

Power  at  60  MHz  (pW) 

1.25 

1 

>  1  GHz 

60 

2.50 

2 

>1  GHz 

120 

3.75 

3 

>  1  GHz 

180 

1PII  it  "li  fliiv'iilii# 

TABLE  5-6.  Power  Consumption  of  On-Chip  Wires  for  0.35  pm  CMOS. 


Wire  length  (mm) 

Energy  (pj)  RC  limit 

Power  at  60  MHz  (pW) 

1.25 

0.4 

24 

2.50 

.  •  •  , 

0.8 

48 

3.75 

1.2  >  1  gh|| 

72 

TABLE  5-7.  Performance  of  MCM  Connections. 


p: 1995 

2000 

Line 

Length 

(cm) 

Energy 

(pJ) 

.  Max. 
Freq. 
(MHz) 

Power 

(mW)  200 

MHz 

Energy 

pj| 

Max. 

Freq. 

(MHz) 

Power 

(•W)  300 

MHz 

1 

58 

390 

12 

|1111P| 

5  ■■  . 

108 

320 

22 

10 

170 

ill  265 

34 

22.8 

424 

6.9 

WBM 

233 

225 

47 

Hll 

MMI 

From  Table  5-7  we  can  see  that  the  expected  improvement  in  performance  for  MCM 
technology  over  the  next  five  years  is  significant.  At  a  given  line  length,  the  power  figures  are 
expected  to  drop  by  almost  an  order  of  magnitude  while  the  maximum  frequency  of  operation  will 
almost  double  for  lines  below  10  cm.  However,  for  large  area  MCMs  with  line  lengths  of  15  to  20 
cm  or  above,  the  maximum  frequency  of  operation  of  the  interconnection  will  remain  limited  to 
about  300  MHz. 
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Optoelectronic  interconnections 

Here  we  provide  an  estimate  of  the  power  consumption  of  an  optoelectronic  link.  The  detectors 
are  assumed  to  be  PIN  diodes  and  the  receivers  are  transimpedance  circuits  where  the  number  of 
stages  is  optimized  for  electrical  power  vs.  optical  power  trade-off.  The  transmitters  are  either 
VCSELs  or  MQW  absorption  based  modulators.  The  driver  circuits  are  based  on  a  super-buffer 
design(5-i5)  wj1jcj1  js  a  geometrically  staged  cascaded  buffer  circuit,  designed  to  drive  large  capacitance 
loads  with  minimum  signal  delay.  For  MQWs,  this  design  reduces  to  a  single  large  inverter  due  to 
the  low  capacitance  of  an  MQW  diode,  typically  35  to  100  fF.  In  the  case  of  the  VCSELs,  the 
drivers  must  deliver  relatively  high  currents  since  a  laser  is  modulated  between  threshold  and  a 
high  current  value  for  maximum  speed.  The  main  drawback  of  both  receiver  and  driver  designs  in 
terms  of  performance  is  that  they  have  a  high  DC  power  consumption  (except  for  the  MQW 
drivers). 

In  Table  5-8,  several  types  of  transmitters  and  their  associated  circuits  are  evaluated  in  a  4096 
channel  system  configuration  and  the  corresponding  power  consumption  per  channel  is  provided. 
The  transmitters  are  VCSELs  with  1  mA  or  100  (lA  threshold  current  and  20x20  pm2  MQW 
modulators. <5'16)  Depending  on  the  architecture  and  implementation  of  the  optical  interconnect 
system  architecture,  different  loss  factors  will  be  involved.  Here  we  based  our  calculation  on  the 
OTIS  system,  since  it  will  be  the  interconnect  system  that  is  used  in  the  optoelectronic  approaches. 

In  Table  5-9,  we  project  the  performance  of  the  optoelectronic  link  if  0.35  pm  CMOS 
technology  is  used  for  the  driver  and  receiver  circuits.  In  Table  5-8  and  Table  5-9,  the  numbers  in 
parenthesis  for  the  receiver  columns  indicate  the  number  of  stages  in  the  transimpedance  receiver 
circuit.  The  optical  power  column  indicates  how  much  laser  power  is  delivered  by  the  VCSELs  or 
how  much  optical  power  is  required  for  the  external  laser  for  the  MQW  modulators.  Finally,  the 
number  in  parenthesis  in  the  power  per  channel  column  indicates  the  total  power  required  per 
modulator  including  their  share  of  the  power  consumed  by  the  external  laser.  In  the  following,  we 
will  assume  that  either  VCSELs  or  MQW  modulators  can  be  operated  at  600  MHz.  This  yields  a 
link  budget  of  about  2  mW  and  a  total  power  consumption  of  about  9  W  for  4096  channels.  Note 
that  MQW  modulator  technology  is  readily  available,  while  further  development  is  required  for  low 
threshold  VCSELs.  The  choice  between  transmitter  technology  will  be  decided  based  on  scalability 
issues,  i.e.  which  technology  will  accommodate  faster  links,  and  packaging  issues.  At  present,  it 
appears  that  VCSELs  may  quickly  become  a  better  candidate  since  their  speed  will  not  be  as  limited 
as  modulators  and  they  do  not  require  an  external  laser  source.  However,  based  on  the  numbers  in 
Table  5-7,  Table  5-8,  and  Table  5-9;  it  is  critical  for  the  ultimate  success  of  FSOI  to  develop 
VCSELs  with  low  threshold  power  integrated  in  2-D  arrays  and  suitable  for  flip-chip  bonding. 
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TABLE  5-9.  Power  Consumption  of  an  Optoelectronic  Link  for  0.35  pm  CMOS. 


TABLE  5-8.  Power  Consumption  of  an  Optoelectronic  Link  for  0.8  pm  CMOS. 


0.8  |tm  CMOS  for 

4096  channels 

Receiver 

Power 

Transmitter 

Power 

Optical 

Power 

Power  per 

Channel 

VCSELs  1  mA 

bso. . 

?„  A 

|r  ■  v~:. 

I-l '• J 

60  MHz 

8.7 

W 

(1) 

11.2 

W. 

0.5 

W 

4.9 

mW 

600  MHz 

29.8 

W 

(3) 

18.6 

w ' 

4.5 

W 

lllfj  VA; 

11.8 

mW 

VCSELs  100  pA 

tevt 

1, 

60  MHz 

8.7 

w 

(1) 

2.0 

i  . 

W  v 

0.5 

W 

2.6 

raW 

600  MHz 

29.8 

w 

(3) 

9.4 

i!  • : 

W 

4.5 

W 

i  . 

9.6 

mW 

MQW,  5V,  20x20  pm 

iv  ■  ^ 
6"; 

:  ;7;i 

?V..  •  ',ff. 

'  ''  f4,V 

;  , 

60  MHz 

8.7 

w 

(1) 

0.1 

w  ; 

1.5 

W 

A:'? 

2.2  mW  (3.3) 

600  MHz 

7 

38.0 

w 

(4) 

1.4 

w 

3.9 

W 

9.6 

mW  (12.5) 

0.35  Jim  CMOS  for 

Receiver 

Transmitter 

Optical 

j  Power  per 

4096  channels 

Power 

Power 

Power 

Channel 

1  -  V' ''f  '* 

VCSELs  100  ftA 

60  MHz 

2.0  W  (1) 

2.2  W 

1.0  mW 

600  MHz 

6.0  W  (3) 

3.6  W 

2.3  mW 

' f"'  ■  " ■  \ 

MQW,  5V,  20x20  pm 

60  MHz 

2.3  W  (1) 

0.1  w 

1  .1  «:>>•  ' 

1.0  W 

0.5  mW  (0.8) 

600  MHz 

8.0  W  (4) 

0.8  W 

4.0  W 
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If  we  compare  the  results  of  the  link  analysis,  it  becomes  clear  that  optics  offer  significant 
advantages  in  terms  of  performance.  Note  that  even  with  projected  ceramic  materials  the 
interconnection  speed  of  MCM  does  not  scale  well.  In  fact,  the  situation  is  worse  for  electronics 
than  our  optimistic  projections  predict,  as  demonstrated  by  the  Semi-conductor  Industry 
Association  Roadmap  that  predicts  that  off-chip  interconnects  will  be  limited  to  250  MHz  at  the 
year  2001.  Optical  interconnects  can  easily  scale  this  number  up  to  600  MHz  or  more  with 


71 


presently  available  devices.  Optics  can  therefore  match  the  predicted  on-chip  clock  speeds  and 
optically  interconnected  chip  sets  will  be  able  to  operate  at  single  chip  speeds.  In  terms  of  power 
dissipation  optics  also  offers  an  advantage.  This  is  especially  true  if  one  considers  the  fact  that  we 
have  neglected  performance  degradation  resulted  from  vias  and  corner  turns  in  MCMs.  Also,  we 
used  presently  available  OE  device  parameters  for  optoelectronic  link  power  calculation  while  we 
used  projections  for  the  year  2000  for  estimating  the  power  consumed  in  electrical  links.  It  should 
also  be  noted  that  the  optical  interconnects  link  budget  were  calculated  based  on  the  OTIS 
interconnection  architecture  that  provides  a  good  interconnection  power  efficiency.  We  should 
therefore  caution  that  the  above  results  cannot  be  generalized  for  any  optical  interconnection 
scheme.  Nevertheless,  we  can  conclude  that  optical  solutions  offer  a  power-delay  product  per  link 
advantage  exceeding  a  factor  of  10  at  250  MHz.  We  will  next  describe  the  OTIS  architecture  and 
discuss  its  usefulness  in  the  context  of  FFT  interconnects. 

5. 3. 1.1. 5  OTIS  Approaches 

In  general,  a  2-D  implementation  of  multistage  interconnection  networks  based  on  bounded 
degree  networks  require  a  transpose  operation  to  be  embedded  in  the  circuit.  The  usefulness  of  the 
transpose  interconnection  for  global  connectivity  is  therefore  obvious  in  the  context  of  a  2-D  FFT 
computation.  The  transpose  operation  however,  is  challenging  to  implement  in  electronics  since  it 
requires  global  data  movements  over  long  interconnection  distances.  On  the  other  hand  an  optical 
implementation  can  be  quite  straightforward. 

Figure  (5-8)  shows  a  case  where  OTIS  is  used  for  bit-serial  communication.  In  this  case  NxN 
PEs  are  arranged  in  sfN  x  V/V  groups  in  the  input  plane.  Each  PE  has  a  transmitter/receiver  pair 
and  via  OTIS,  each  PE  of  a  given  group  in  the  input  plane  is  connected  to  one  given  PE  in  each 
group  of  the  output  plane.  Since  this  system  is  symmetrical  with  respect  to  the  vertical  axis  in  the 
middle  of  the  optical  system,  it  can  be  easily  folded  onto  itself  by  placing  a  mirror  along  that  axis. 
In  this  case,  the  input  and  output  planes  become  the  same,  and  global  communications  is 
established  on  the  PE  plane. 

Alternatively,  Figure  5-9  illustrates  the  case  where  OTIS  is  used  for  bit-parallel 
communication.  In  this  case,  each  PE  in  plane  1  on  the  left  has  a  transmitter/receiver  pair  for  each 
bit  of  it’s  parallel  output.  OTIS  maps  optically  the  bit  outputs  of  one  PE  to  different  groups  in  plane 
2  on  the  right.  If  each  group  on  the  right  is  a  crossbar  equivalent  electronic  switch,  then  all  the  bits 
can  be  routed  identically  by  these  switches  as  illustrated  in  Figure  5-9.  OTIS  maps  the  switched 
bits  back  onto  plane  1  on  the  left,  and  all  the  bits  end-up  on  the  detector  inputs  of  a  single  PE.  Note 
that  the  switches  in  plane  2  on  the  right  are  local  and  very  simple  and  can  all  be  controlled  via  a 
single  centralized  controller  since  they  all  switch  the  same  way  at  any  given  time.(5',7)  This  system 
provides  full  routing  (crossbar  equivalent  switching)  in  a  two-pass,  single-stage  optoelectronic 
system. 
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Figure  5-8.  OTIS  for  a  bit-serial  connection  between  PEs. 
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Figure  5-9.  OTIS  for  a  bit-parallel  communication  between  PEs. 
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5.3.1.2  Optoelectronic  Approach 

In  this  section,  we  first  describe  two  optoelectronic  system  architectures  that  remove  the 
roadblocks  and  limitations  outlined  previously.  We  evaluate  their  potential  performance  and 
compare  them  to  all  electrical  approaches  based  on  MCM  technology. 

5. 3. 1.2.1  Optoelectronic  Systems  Configurations  and  Performance 

The  optoelectronic  approaches  for  a  parallel  FFT  machine  are  based  on  optimally  combining 
electronics  technology  with  free-space  optical  interconnection  technology.  In  order  to  select  a  good 
potential  parallel  optoelectronic  implementation  of  an  FFT  machine,  we  have  analyzed  several 
possible  implementations  in  terms  of  performance  over  cost.  Namely,  we  have  looked  at 
throughput  versus  power  consumption  for  all  the  operations  involved  in  the  FFT  computation  for 
various  technologies  and  system  implementations.  These  operations  include  the  data  input,  the 
memory  to  processor  connections,  the  processor  to  processor  connections,  the  processor 
computations,  and  the  data  output. 

In  the  following,  we  show  the  performance  and  cost  analysis  of  two  different  optoelectronic 
systems  for  FFT  computations.  The  first  one  is  based  on  custom  silicon  chips  interconnected  via 
the  Optical  Transpose  Interconnection  System  (OTIS)  based  system.  In  this  case,  I/O  and 
communication  to  the  chips  are  performed  in  a  Word-Parallel/Bit-Serial  fashion.  The  second 
system  uses  Commercial  Off-The-Shelf  (COTS)  chips  which  are  also  interconnected  via  an  OTIS 
based  system.  In  this  case,  however,  I/O  and  communication  between  chips  is  performed  in  a 
Word-Parallel/Bit-Parallel  way. 

In  the  following,  the  two  systems  are  evaluated  for  performing  1024x1024  2-D  complex  to 
complex  FFTs  on  32  bit  word  precision.  It  is  assumed  that  light  transmitters  and  detectors  are  flip- 
chip  bonded  to  silicon  and  are  interconnected  using  a  free-space  optical  system.  In  our  calculations, 
we  have  assumed  that  the  optical  interconnect  system  supports  4096  channels  and  consists  of  two 
planes  of  diffractive  lenslet  arrays  and  the  additional  optics  required  to  illuminate  Multiple  Quantum 
Well  (MQW)  modulators.  Although  for  this  report  we  have  assumed  that  the  system  utilizes  MQW 
modulators,  it  can  be  equally  well  implemented  using  optical  sources  such  as  Vertical  Cavity 
Surface  emitting  Lasers  (VCSELs)  flip-chip  bonded  to  the  chips,  in  which  case  the  optical  system 
may  not  require  an  external  laser  and  additional  illumination  optics. 

5.3.1 .2.1.1  Bit-Serial  /  Custom-Electronics  approach 

In  the  Bit-Serial  approach,  an  electronics  plane  consists  of  an  8x8  array  of  custom  chips  where 
each  chip  consists  of  an  8x8  array  of  Processing  Elements  (PEs)  for  a  total  of  64x64  (4096)  PEs. 
Each  PE  consists  of  an  FPU  that  performs  the  required  complex  multiplications  and  additions. 
Each  PE  also  contains  the  necessary  memory  to  store  the  data  and  program  required  for  performing 
the  FFT.  The  amount  of  memory  required  per  PE  to  map  a  1024x1024  FFT  onto  the  64x64  PEs  is 
16  Kbits  for  data  storage,  and  performing  the  FFT  algorithm  will  require  an  additional  16  Kbits 


for  program/workspace.  Also,  each  PE  has  an  optical  Transmitter/Receiver  pair  and  six  32-bit  wide 
on-chip  electrical  channels  for  communicating  with  other  PEs.  The  on-chip  connections  form  a 
hypercube  among  the  64  PEs  of  each  chip.  Figure  5-10  shows  a  schematics  of  this  approach. 


Figure  5-10.  Bit-Serial  /  Custom  Electronic  chips  approach 


We  first  designed  a  64-bit  FPU  optimized  for  computing  higher-radix  butterflies  for  the 
optoelectronic  FFT  architecture.  The  design  was  laid-out  for  0.8  and  0.5  micron  CMOS 
technologies  available  from  MOSIS.  We  also  estimated  the  performance  of  the  design  with  a  0.35 
micron  CMOS  process.  Table  5-10  summarizes  these  results.  The  FFT  Processing  Unit  (FPU)  is 
a  computational  engine  that  performs  the  complex  mathematical  operations  required  for  the 
calculation  of  an  FFT.  Broken  down  into  primitive  operations,  the  FFT  is  just  a  series  of 
multiplications,  additions,  and  subtractions.  The  FPU  can  perform  the  multiplication  of  two 
complex  numbers,  as  well  as  the  addition  and  subtraction  of  two  complex  numbers.  Furthermore, 
it  can  mix  the  operations  of  addition  and  multiplication  or  subtraction  and  multiplication.  Finally, 
building  on  the  operations  mentioned  above,  it  can  achieve  a  multiply-  accumulate  function  which 
is  heavily  used  in  higher-order  radix  FFT  implementations. 

The  FPU  is  composed  of  a  data  storage  unit  and  an  arithmetic  unit.  The  data  storage  unit  is  a 
512x64  high-speed  SRAM.  This  is  used  to  store  the  initial  data  values,  and  the  intermediate 
calculation  results  as  necessary.  The  arithmetic  unit  contains  a  complex  multiplier,  a  complex 
adder/subtracter,  and  some  scratchpad  storage  registers.  All  of  these  components  are  connected  by 
a  64-bit  data  bus  which  represents  the  size  of  each  data  word.  Each  data  word  is  a  complex  number 
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with  a  32-bit  real  part  and  a  32-bit  imaginary  part.  The  adder/subtracter  unit  is  a  block  that  is 
composed  of  a  64-bit  high-speed  adder,  a  column  of  inverters,  and  a  multiplexer.  Two  complex 
numbers  are  added  when  the  multiplexer  is  set  to  pass  the  non-inverted  data  straight  to  the  adder 
input.  In  a  binary  system,  subtraction  can  be  achieved  by  adding  an  unchanged  number  to  the  2’s 
complement  of  another  number.  Subtraction  is  performed  in  the  adder/subtracter  unit  when  the  data 
to  input  A  is  inverted  and  the  Carry  In  bit  is  set  to  a  ‘  1’  to  form  a  2’s  complement  addition.  During 
normal  addition,  the  Carry  In  bit  is  set  to  ‘O’.  The  sum  or  difference  is  then  stored  in  a  register. 


TABLE  5-10.  Design  Parameters  of  the  FPU  and  Memory  for  the  Bit-Serial/Custom-Chip 
Optoelectronic  Approach. 
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Clock  speed 
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The  complex  multiplier  decomposes  the  multiplication  of  two  complex  numbers  as  follows: 
PROD  =  (Br  +  jBi)*(Wr  +  jWi) 

PROD  =  (Br*Wr  -  Bi*Wi)  +  j(Br*Wi  +  Bi*Wr)  (5-18) 

Therefore,  the  multiplier  unit  consists  of  four  32-bit  x  16-bit  multipliers  and  two  32-bit  adders. 
The  output  of  the  multiplier  unit  is  a  64-bit  combined  real  and  imaginary  number. 
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Additional  technology  assumptions  were  made  in  order  to  evaluate  this  optoelectronic 
approach: 

An  on-chip  wire  needs  to  be  at  least  4  A  wide  and  spaced  apart  by  3\,  where  X  is  half  the 

CMOS  feature  size. 

Thus,  for  a  CMOS  feature  size  of  0.35  pm,  each  wire  and  associated  spacing  occupies  a  width  of 
about  1  pm.  Because  the  PEs  are  laid  out  as  a  hypercube,  each  PE  is  connected  to  LogN  other 
PEs.  This  yields  a  total  connection  length  requirement  per  PE  of  2  xN1/2  xA  where  A  is  the  PE 
spacing,  each  connection  is  32  bits  wide,  and  each  of  these  32  wires  is  bi-directional.  Each  wire 
requires  electrical  transmitter  and  receiver  circuits.  Each  PE  also  requires  an  optical 
transmitter/receiver  pair  and  their  associated  circuits  operating  at  600  MHz.  The  area  and  energy 
requirements  of  these  components  are  given  in  Table  (5-11).  Note  that  these  values  are  only 
approximations  but  provide  insight  about  the  technology  capabilities  and  trade-offs.  Under  these 
assumptions,  one  of  the  custom  chips  (8x8  PEs)  is  estimated  to  occupy  a  little  1  cm2  die.  For  a  60 
MHz  clock  speed  this  yields  about  5  W  power  consumption  and  5  W/cm2  power  dissipation  while 
for  600  MHz  these  numbers  become  about  50  W  and  50  W/cm2  respectively.  These  figures  are 
dominated  by  the  FPU  and  electrical  wires  which  account  for  over  90%  of  both  power  and  area 
requirements.  For  the  entire  system  that  consists  of  64  such  chips,  the  power  consumption 
becomes  320  W  for  a  60  MHz  clock  and  3.2  KW  for  a  600  MHz  clock. 

TABLE  5-11.  Summary  of  Area  and  Power  Requirements  of  the  Various  System  Components. 
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The  FFT  is  computed  using  8-  and  16-point  FFT  building  blocks  used  in  row-column  fashion 
and  combined  in  a  mixed-radix  algorithm.  The  idea  here  is  to  use  large  radix  building  blocks,  as 
they  involve  many  fewer  multiplications  at  the  cost  of  more  additions.  For  a  1024x1024  FFT  on  a 
64x64  PE  system,  each  group  of  PEs  on  a  chip  has  a  128x128  array  of  values.  Each  PE  starts 
with  a  16x16  array  of  these  values.  Using  the  radix-16  FFT  block,  it  computes  the  FFT  of  its 
values  using  a  row-column  approach,  which  we  define  as  a  “16-point  main  block.”  Then,  it 
redistributes  its  data  across  the  processors  in  its  group.  In  the  next  step,  each  processor's  256 
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values  can  be  thought  of  4  8x8  blocks,  and  8-point  transforms,  called  8-point  main  block  will  be 
used  on  them  (this  is  because  128  =  16  *  8).  At  this  point,  a  128x128  2-D  FFT  has  been 
performed  on  the  values  in  each  group.  Then,  the  optoelectronic  OTIS-based  shuffle  performs  the 
next  reorganization,  a  transpose  of  the  data,  and  the  process  is  repeated  again. 

The  time  required  to  perform  a  radix- 16  FFT  block  can  then  be  calculated.  It  requires  148 
additions  and  20  multiplications,  which  is  equivalent  to  1740  clock  steps  or  29  jisec  at  60  MHz  and 
2.9  |isec  at  600  MHz.  The  time  for  a  16-point  main  block  is  then  16x29  =  464  |isec  or  46.4  [isec 
respectively.  Under  these  assumptions,  the  run-time  of  a  1024x1024  complex  to  complex  valued 
FFT  as  described  in  the  previous  paragraph  can  be  estimated.  At  60  MHz,  we  find  that  the  total 
time  is  about  4.5  msec  with  4.1  msec  compute  time,  0.3  msec  electrical  connect  time,  and  about 
0.1  msec  optical  connect  time  where  compute  time  is  consumed  only  in  the  FPUs,  electrical 
connect  time  is  spent  communicating  on-chip,  and  optical  connect  time  is  spent  for  I/O  and 
communications  between  chips.  At  600  MHz,  the  total  time  becomes  0.5  msec  with  a  breakdown 
of  0.41,  0.03,  and  0.11  msec  for  compute,  electrical  connect,  and  optical  connect  respectively. 
Thus,  by  using  optics,  the  interconnect  times  have  become  negligible  and  the  overall  speed  remains 
limited  by  the  electronic  computations  in  the  FPU.  This  is  an  important  conclusion  that  indicates 
that  this  approach  will  remain  viable  even  with  improving  electronics.  The  results  of  the 
Bit-Serial/Custom-Electronics  approach  are  summarized  in  Table  5-12. 

TABLE  5-12.  Performance  Results  for  the  Bit-Serial/Custom  Chips  Optoelectronic  FFT 
Approach. 
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Important  conclusions  can  be  derived  from  the  evaluation  of  this  configuration: 

•  This  configuration  allows  1,000  frames  per  second  with  no  latency  (assuming  about  300  MHz 
clock  rate).  However,  at  high  speeds,  the  required  power  is  large,  consumed  mainly  by  the 
FPUs  and  on-chip  electrical  connections.  Note  that  the  power  required  for  the  optical 
interconnects  is  essentially  negligible  compared  to  the  other  system  components  at  high  speed. 

•  The  speed  of  the  electronics  (FPUs)  is  the  main  limiting  factor  in  the  system.  Thus,  better  ways 
of  implementing  the  FPUs  must  be  developed.  However,  as  can  be  seen  from  Table  5-12  the 
power  densities  reached  for  fast  silicon  chips  become  rapidly  unmanageable.  Thus,  faster 
technologies  with  good  power  delay  products  should  be  targeted:  low-voltage  CMOS  or  GaAs 
complementary  FET  technology  (if  made  practical)  would  be  good  candidates  at  this  point. 

•  Optics  is  essentially  free  in  terms  of  power  and  performance. 

In  the  previous  estimates,  one  of  the  assumptions  was  that  an  FPU  can  perform  a  complex 
addition  in  5  clock  cycles  and  a  complex  multiplication  in  50.  To  further  improve  the  performance 
of  the  Bit-Serial  approach,  let  us  consider  a  case  where  the  FPU  can  perform  these  operation  in  1 
and  10  clock  cycles  respectively.  This  is  a  difficult  assumption  to  justify  since  such  FPUs  tend  to 
be  very  large  in  area  and  would  lead  to  a  Bit-Serial/Custom  chips  implementation  that  might  not  be 
realistic.  In  this  case,  the  total  time  for  performing  a  lKxlK  FFT  become  1  msec  (60  MHz  clock) 
and  0.16  msec  (600  MHz  clock)  respectively.  Therefore,  the  performance  goals  can  be  achieved 
even  for  a  fairly  low  system  clock  speed. 

At  this  point,  it  should  be  emphasized  that  the  optoelectronic  interconnect  technology  does 
not  solve  all  aspects  of  the  parallel  FFT  implementation  problem  but  only  removes  the  interconnect 
roadblock  making  room  for  aggressive  electronic  VLSI  technologies. 

5.3.1. 2. 1.2  Bit-Parallel /Commercial-Off-The-Shelf  chips  OE  approach 
In  this  approach,  the  electronics  plane  consist  of  an  8x8  array  of  Commercial-Off-The-Shelf 
(COTS)  DSP  or  FFT  chips.  Also,  a  switching  plane  of  8x8  routing  chips  will  be  used,  and  free- 
space  optics  will  be  utilized  to  connect  the  two  planes.  The  switching  plane  will  consist  of  either 
OTIS  hypercubes,  crossbar  chips,  or  cyclic  shift  circuitry.  The  OTIS  system  that  connects  the 
processing  plane  to  the  switching  plane  will  be  used  in  a  Bit-Parallel  fashion  for  interconnections 
(Fig.  (5-1 1)).  Each  processor  chip  is  now  given  a  128x128  sub-image  (=  16K  word-pairs)  of  the 
image. 
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64  COTS  chips  Transmissive  OTIS  System  Switch  Plane 


Two  assumptions  need  to  be  made  in  order  to  evaluate  the  performance  of  such  system.  First  it 
is  assumed  that  an  arbitrary  N  point  2-D  FFT  can  be  computed  as  fast  as  an  N  point  1-D  FFT  on 
the  processing  chip.  It  is  also  assumed  that  the  time  required  for  an  N  point  FFT  scales  as  N  logN 
as  N  grows.  Second,  it  is  assumed  that  the  routing  in  the  switch  plane  can  be  performed  at  the 
same  speed  as  the  processing  chip  clock.  This  last  assumption  is  fair  since  DSP  or  dedicated  FFT 
chips  have  reasonable  clock  speeds  and  switch  planes  operating  at  speeds  of  200  MHz  have 
already  been  demonstrated.(517)  In  this  second  optoelectronic  approach  we  consider  several  DSP 
and  FFT  chips  for  which  the  performance  is  described  in  the  literature.(5'9,5'18) 

To  perform  the  2-D  lKxlK  FFT,  data  is  first  loaded  on  the  64  chips,  where  each  chip  receives 
a  128x128  data  block.  The  FFTs  of  the  128x128  data  blocks  are  computed  locally  on  each  of  the 
chips.  For  the  chips  we  consider,  this  is  done  in  the  time  required  for  computing  the  equivalent  32 
lK-point  1-D  FFTs.  Data  is  redistributed  among  the  chips  by  using  the  Bit-Parallel  OTIS  system. 
Another  set  of  local  FFTs  on  128x128  data  blocks  is  then  performed  and  the  data  is  finally  output. 
The  chips  we  consider  are  the  same  as  described  in  Section  5.3. 1.1.2.  The  first  two  have  enough 
on-chip  memory  to  store  the  required  128x128  data  block  while  for  the  others,  external  memory 
chip(s)  will  be  required.  Note  that  in  the  case  of  the  Plessey  system  these  memory  chips  are 
included  in  the  chip-set.  We  assume  that  the  cost  in  terms  of  performance  for  adding  these  memory 
chips  has  little  effect.  Indeed,  with  the  rapid  progress  in  DRAM  and  SRAM  technologies  and  by 
utilizing  parallel  RAM  banks  in  a  pipelined  fashion  the  access  time  to  these  devices  can  be 
essentially  neglected.  In  any  case,  we  will  do  the  same  assumption  later  when  evaluating  the 
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potential  performance  of  an  MCM  based  implementation.  Table  5-13  summarizes  the  results  of  the 
Bit-Parallel/COTS  approach. 


TABLE  5-13.  Performance  Results  for  the  Bit-Parallel/  COTS  Chips  Optoelectronic  FFT 
Approach. 
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As  was  the  case  in  the  previous  optoelectronic  approach,  the  systems  based  on  existing  chips 
are  very  largely  dominated  by  the  computation  time  and  power  requirements.  It  can  also  be 
observed  that  performance  can  be  increased  by  using  more  computational  efficient  chips  at  the 
expense  of  system  power  and  system  cost.  One  sure  way  to  achieve  the  lKxlK  1,000  frames  per 
second  performance  is  to  utilize  future  FFT  chips  specially  designed  for  providing  orders  of 
magnitude  improvements  in  computation  for  reasonable  power  figures.  Today,  such  chips  may  be 
impractical,  however,  with  evolving  feature  sizes,  they  may  become  viable  within  the  next  five 
years.  Thus  it  is  critical  to  start  investigating  means  by  which  the  interconnect  roadblocks  can  be 
removed  in  systems  that  use  these  chips.  Another  important  factor  to  consider  is  that  when  using 
COTS  chips  the  I/O  to  and  off  the  chips  is  limited  due  to  power  considerations  as  they  are  designed 
to  be  assembled  as  single  chips  in  standard  packages  and  must  drive  large  I/O  pins.  For  example, 
the  existing  chips  have  an  I/O  bandwidth  of  40  MHz,  32  bits  wide  while  it  is  projected  that  the 
future  generation  FFT  chip  I/O  bandwidth  will  be  150  MHz,  64  bits  wide.  Thus,  this  explains 
while  in  Table  5-14  the  Compute,  I/O  and  Connect  times  are  essentially  of  the  same  order  with  the 
future  generation  FFT  chip.  Alternatively,  in  the  Bit-Serial/Custom-Electronics  design  the  I/O 
could  be  tailored  to  the  optics  capabilities  and  was  not  limiting  the  system’s  performance. 
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5.3. 1.2.2  Why  Not  More  Optics? 

One  of  the  conclusions  of  the  system  analysis  previously  presented  is  that  the  optoelectronic 
technology  allows  fast  inter-processor  communication  at  essentially  no  cost  in  terms  of  power. 
Based  on  this  conclusion,  one  could  think  that  the  Bit-Serial/Custom-Electronics  system  could  be 
designed  to  use  more  optoelectronics  and  that  the  overall  performance  could  therefore  be  improved. 
However,  by  utilizing  parallel  electrical  paths  between  the  PEs  on  a  chip,  rather  than  serial  optical 
links,  the  system  speed  is  optimized  at  the  expense  of  a  larger  electrical  power  consumption.  If 
serial  optical  links  were  used  instead,  the  power  consumption  of  the  connections  could  be  reduced 
but  this  would  occur  at  the  expense  of  system  speed.  Since  the  main  source  of  power  consumption 
in  the  system  is  due  to  the  FPUs,  this  would  not  bring  any  significant  power  savings  but  would 
slow  down  the  system.  Alternatively,  more  optical  links  could  be  used  to  keep  the  inter-PE 
connections  on  a  chip  parallel.  In  this  case,  the  cost  in  terms  of  numbers  of  transmitters  and 
receivers  would  jump  to  large  values  (over  750,000:  6  connections  per  PE  with  32  bits  wide  for 
64x64  PEs)  and  the  additional  complexity  of  the  optics  and  related  reliability  issues  would 
essentially  make  the  system  impractical. 

Alternatively,  a  system  design  that  uses  more  serial  optical  links  between  more  optoelectronic 
PEs  could  be  envisioned.  This  approach  would  essentially  allow  us  to  remove  all  electrical 
connections  between  PEs  and  reduce  the  memory  requirements  per  PE.  However,  as  the  size  of  the 
FFT  problem  grows  this  hardware  approach  would  become  rapidly  unmanageable.  For  example 
assuming  that  an  FPU  area  is  1  mm2,  the  total  silicon  area  required  would  be  lm2  for  the 
1024x1024  FFT,  which  is  very  unrealistic.  The  number  of  receivers  and  transmitters  would  also 
grow  and  create  problems  of  its  own  in  terms  of  packaging,  available  optical  power  and  optical 
system.  Thus,  it  seems  that  the  approaches  described  above  represent  a  good  compromise  in  terms 
of  hardware  complexity  and  feasibility  vs.  performance  and  power  consumption. 

5.3. 1.2. 3  Why  Not  More  Electronics? 

One  way  to  justify  that  optoelectronic  systems  are  worthwhile  developing  is  to  compare  them  to 
a  potential  all  electrical  implementation  based  on  MCM  technology.  Both  the  Bit-Serial/Custom- 
Electronics  and  the  Bit-Parallel/COTS  chips  options  could  be  built  with  MCM  technology.  The 
required  wire  lengths  on  the  MCM  can  be  calculated  based  on  the  chip  areas  and  their  minimum 
spacing  on  the  MCM  using  Manhattan  distances.  Then,  their  power  consumption  can  be  derived 
for  existing  and  projected  MCM  technologies.  For  each  one  of  these  implementations  there  is  a 
minimum  FFT  size  and  clock  rate  that  justifies  the  use  of  optoelectronics  as  the  off-chip 
interconnect  technology  of  choice. 

5.3.1 .2.3.1  Bit-Serial/Custom-Electronics  approach 

In  this  case  we  use  the  same  FPU  chips  as  described  before.  The  chips  are  about  1  cm2  and 
each  chip  contains  64  PEs;  the  chips  are  built  using  0.35  |im  technology.  Each  chip  has  a  serial 
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connection  to  all  the  other  chips  which  yields  4096  total  connections.  In  addition  power,  ground 
and  various  control  signals  need  to  be  routed  to  all  the  chips.  Note  that  as  opposed  to  the  optics 
case  the  I/O  of  data  to  the  chips  might  be  quite  difficult  to  implement  since  it  will  require  additional 
demultiplexer  circuits  to  route  the  correct  image  data  to  the  chips.  Based  on  the  parameters 
described  before  and  omitting  the  required  demultiplexing  circuits  for  I/O,  the  following  MCM 
performance  can  be  estimated  (see  Table  5-14).  All  the  wire  lengths  are  calculated  based  on  the 
chip  areas  and  their  minimum  spacing  on  the  MCM.  Then,  their  power  consumption  can  be  derived 
for  existing  and  projected  MCM  technologies. 

TABLE  5-14.  Performance  of  the  Bit-Serial/Custom-Electronics  MCM  Implementation. 


From  Table  5-14,  it  is  clear  that  an  all  electrical  system  based  an  MCM  implementation  of 
custom  chip  has  a  very  similar  behavior  as  the  optoelectronic  implementation  described  previously. 
The  projected  numbers  (right  column  in  Table  5-14)  are  actually  almost  identical  as  those  of  the 
optoelectronic  implementation. 

The  factors  of  choice  in  this  case  are  however  fairly  clear.  First,  it  can  be  argued  that  the 
feasibility  of  such  a  large  MCM  is  questionable.  Remember  that  the  area  quoted  in  the  table  is  only 
for  the  FPU  chips  and  do  not  include  the  I/O  chips  and  unavoidable  glue  logic  that  will  be  required 
in  the  system,  making  the  area  grow  by  a  significant  amount  to  at  least  12x12  cm2.  Second  and 


83 


most  important,  the  MCM  interconnections  are  frequency  limited  to  300  MHz,  even  with  a  five 
year  projection.  This  limitation  is  in  effect  due  to  the  drivers  which  must  be  impedance  matched  to 
the  off-chip  wire.  This  puts  an  upper  bound  on  how  fast  such  a  system  could  run  when  FPUs 
become  more  and  more  efficient,  since  in  this  case,  the  system  will  become  I/O  and  interconnect 
limited.  Note  that  in  the  case  of  VCSEL  based  optoelectronic  implementation,  this  limit  is  much 
higher. 

5.3.1.2.3.2  Bit-Parallel /COTS-chios  approach 

To  simplify  the  comparison  for  this  approach  we  only  evaluate  the  implementation  of  the 
system  with  future  generation  FFT  chips  and  5  years  projected  MCM  performance.  In  this  case, 
the  system  requires  64  FFT  chips  which  are  interconnected  to  switches  in  order  to  establish  global 
communications  between  all  the  FPU  chips.  We  assume  that  the  chips  are  identical  to  those  used  in 
the  optoelectronic  modeling.  Each  FFT  chip  is  assumed  to  have  a  32  bit  wide  I/O  port  operating  at 
a  maximum  speed  of  150  MHz  which  is  the  projected  number  for  such  chips.  In  addition, 
switching  chips  are  required  so  that  global  communications  can  be  established.  We  assume  that 
four  64-inputs/64-output  crossbars  can  be  implemented  on  a  single  chip  utilizing  the  same  power 
as  the  switching  plane  in  the  optoelectronic  approach.  We  also  assume  that  each  FFT  chip  requires 
an  additional  1  Mbit  memory  chip  to  store  the  required  64  Mbits  of  data  of  a  1024x1024  image 
along  with  the  required  control  logic.  This  yields  an  MCM  that  must  accommodate  132  chips  with 
4096  global  (FFT  chips  to  crossbars)  and  2048  local  (FFT  chip  to  memory)  wires.  Based  on  the 
future  generation  FFT  and  MCM  projections  parameters,  the  performance  of  the  system  can  be 
calculated  (see  Table  5-15). 


TABLE  5-15  Performance  of  the  Bit-Parallel/COTS-chip  MCM  Implementation. 


lKxlK  FFT 

MCM 

Area  (cm2) 

14.2  x  15.4 

Number  of  wires  per  plane 

>  200 

Off-chip  maximum  frequency  of  longest  line  (MHz) 

250 

Off-Chip  Connect  and  I/O  Time  (msec) 

0.44 

OfUChip  Connect  Power  (W) 

7 

’ '  Total"  Time  (msec) 

.70 

Total  Power  (W) 

583 

As  was  the  case  for  the  Bit-Serial  approach,  the  numbers  indicate  that  this  MCM 
implementation  is  very  close  in  performance  to  the  corresponding  optoelectronic  approach.  The 
main  difference  is  still  in  the  interconnections  maximum  operating  frequency  which  will  be  limited 
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to  about  250  MHz  on  the  MCM.  However,  this  has  no  bearing  on  the  Bit-Parallel/COTS-chip 
approach  that  uses  the  next  generation  FFT  chip  since  the  I/O  of  that  chip  is  only  150  MHz. 

5.3.1 .2,3.3  Comparison  Summary 

Based  on  the  previous  discussion,  we  can  conclude  that  for  64x64  transpose  and  clock  rates  in 
excess  of  200  MHz  (presently),  and  300  MHz  (in  2000),  it  becomes  interesting  to  consider  an 
optoelectronic  implementation.  If  larger  size  transpose  or  higher  clock  speeds  are  desired,  the  only 
foreseeable  implementation  technology  remains  the  optoelectronic  interconnects.  Beyond,  300 
MHz  clock  rates  and  lKxlK  image  sizes,  additional  limitations  are  imposed  on  the  memory  and 
system  I/O  that  might  also  require  optical  interconnects.  These  limitations  require  further  studies 
that  we  intend  to  conduct  in  the  near  future. 

To  get  the  previous  results,  we  have  only  considered  MCMs  with  series  terminated 
transmission  lines.  If  more  speed  is  needed,  different  termination  schemes  (parallel  or  active)  can 
be  adopted.  However,  these  solutions  cost  power.  Our  simulations  show  that  a  parallel  termination 
brings  potentially  30  %  more  speed  for  3  to  5  times  more  power.  In  this  case  optics  becomes 
significantly  more  power  efficient  (well  over  10  times)  for  similar  or  better  speeds. 

5.3.2  Routing  and  Sorting  applications 

Of  particular  interest  of  EOCA  system  is  the  implementation  of  interconnection  networks  to 
provide  for  processor  to  processor  communication.  The  basic  interconnect  proposed  as  an  integral 
part  of  the  3-D  computer  is  the  two-dimensional  mesh  which  allows  each  processor  to 
communicate  with  either  one  of  four  nearest  neighbors.  This  network  is  important  in  image 
processing  applications  and  any  other  function  which  exhibits  local  data  dependencies.  To  broaden 
the  scope  of  applications  of  the  3-D  computer  and  expand  its  capabilities  with  more  powerful 
communications  and  high  bandwidth  I/O,  a  two-stage  electro-optical  network  (2S-EON)  has  been 
proposed. 

The  purpose  of  this  analysis  is  to  provide  a  summary  of  the  routing  characteristics  of  the 
2S-EON.  First,  it  will  be  shown  that  the  2S-EON  is  an  Expanded  Delta  Network  (EDN)  and  that 
as  such  it  can  perform  any  permutation  in  exactly  two  passes  using  global  knowledge  (off-line 
routing).  Bounds  on  on-line  routing  are  developed  by  means  of  mapping  a  sorting  algorithm  onto 
the  structure  which  in  turn  can  effectively  emulate  the  mesh. 

5.3. 2.1  System  Architecture 

There  are  two  distinct  sections  to  the  3-D  computer  defined  here:  the  wafer-scale  3-D  processor 
array  and  the  global  two-stage  router/permutation  network  implemented  as  an  electro-optical 
hybrid.  Figure  5-12  shows  a  simplified  version  of  the  system. 

The  set  of  stacked  VLSI  wafers  define  the  processing  elements  with  functions  distributed 
across  the  wafers  rather  than  on  a  single  wafer  surface.  Consistent  with  this  concept,  planar 
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processing  connectivity  also  exists  within  at  least  one  wafer  in  the  stack.  A  typical  interconnect  of 
this  kind  would  be  the  two-dimensional  grid  encountered  in  typical  mesh-connected  parallel 
computers.  Because  the  cost  of  communications  can  become  an  overburden,  algorithms  that  run  on 
a  two-dimensional  array  must  exhibit  good  locality  and  require  communications  with  nearest 
neighbors  only. 
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Figure  5-12.  A  simplified  version  of  a  two-stage  electro-optical  network. 

To  accommodate  scattered  communications  patterns,  a  2S-EON  is  provided  in  the  enhanced 
3-D  architecture.  This  network,  as  shown  in  Figure  5-4,  is  defined  below.  Its  routing  capabilities 
will  be  analyzed  by  developing  upper  bounds  relative  to  its  sorting  capabilities. 

Definition  1 .  The  two-stage  interconnect  consists  of  a  set  of  2 nnxn  (n-input,  n-output) 
electronic  switches  arranged  into  two  columns  of  n  switches  each.  Let  lsi  denote  the  i  -  th  switch 
of  column  l  and  let  Ik  represent  each  one  of  the  n 2  inputs  to  the  network.  Ik  connects  to  input  j  of 
switch  if,  and  only  if,  k  =  in  +  j . 

The  interconnection  between  stages  is  such  that  output  i  of  switch  ,5.  connects  to  input  j  of 
switch  2sr 

Note  that  according  to  the  definition  above,  the  first  stage  groups  the  columns  of  an  nxn  =  n2 
array  where  the  grid  nodes  have  been  mapped  onto  an  n 2  linear  array  assuming  a  row  major 
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indexing  scheme.  In  between  stages,  the  columns  are  transposed  such  that  each  switch  of  the 
second  stage  groups  the  rows  of  the  same  array.  The  transpose  permutation  is  done  by  optical 
means  following  the  natural  geometry  of  lenslets  optics.  As  a  matter  of  convenience,  the  input 
transpose  permutation  can  be  transferred  to  the  output  of  the  network  and  thus  obtain  an  equivalent 
functionality. 


5.3.2. 2  Off-line  Permutation  Routing 

The  2S-EON  can  be  seen  as  a  particular  case  of  the  Expanded  Delta  Network  shown  in 
Figure  5-3.  Note  that  the  inputs  are  single  wire  lines  while  the  interstage  interconnect  carry  many 
signals  simultaneously.  Whether  the  implementation  of  this  bandwidth  multiplication  mechanism  is 
electrical  or  optical  is  a  matter  of  implementation.  For  the  purpose  of  analysis,  they  are  equivalent. 
The  ratio  of  high  bandwidth  interstage  connections  to  the  low  bandwidth  input  is  called  the 
expansion  factor. 
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Figure  5-13.  The  2S-EON  as  a  particular  case  of  the  Expanded  Delta  Network. 
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Referring  to  Figure  5-14,  a  Delta  Network  with  expansion  factor  of  1  is  shown  and  used  in 
Figure  5-15  to  build  a  two-pass  network  (two  nets  in  series).  By  reduction  of  the  middle  two 
stages  to  a  single  less  powerful  stage  the  two  cascaded  networks  are  shown  to  behave  as  a  single 
Clos  network.  Because  it  is  known  that  the  Clos  network  can  perform  any  permutation  in  one  pass 
with  global  knowledge,  the  cascaded  version  of  the  2S-EON  is  capable  of  the  same. 

To  study  on-line  routing  bounds,  one  must  use  other  techniques.  Because  the  network 
naturally  maps  the  Shearsort  algorithm,  it  will  be  used  to  derive  on-line  routing  bounds. 
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Figure  5-14.  Schematic  of  Delta  Network  constructed  using  nXn  switches. 
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Figure  5-15.  With  global  control,  Delta  Network  is  a  two-pass  network. 
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5.3.2.3  The  Shearsort  Algorithm 

Shearsort  is  a  two-dimensional  sorting  algorithm  for  nxn  arrays  like  the  one  shown  in 
Figure  5-16.  The  basic  operation  allowed  to  define  the  method  of  computation  are  given  in 
Figure  5-17. 
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Figure  5-16.  Shearsort  algorithm:  two  dimensional  computation  model  for  nxn  array. 

Operations  Allowed 

•  Conditional  exchange  between  elements 
in  a  row  or  column 

•  Routing  elements  within  a  row  or  column 

Figure  5-17.  The  basic  allowed  operations  for  the  shearsort  algorithm. 

Given  an  n2  -element  sequence,  shearsort  is  an  iterative  algorithm  as  follows: 

Definition  2.  Shearsort  (array  nxn); 
repeat  logn  times 

sort  all  columns  in  parallel  in  non-decreasing  order  from  top  to  bottom 
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sort  all  even  numbered  rows  in  non-decreasing  order  from  left  to  right 
sort  all  odd  numbered  rows  in  non-decreasing  order  from  right  to  left 
end  repeat 

Shearsort  can  run  in  a  two-dimensional  mesh  connected  array  with  a  total  time  complexity  of 
0(n  log n)  nearest  neighbor  compare-exchange  operations.  To  prove  this  we  use  the  Zero-one 
principle  given  in  Figure  5-18.  Note  that  when  sorting  the  columns  of  a  row-sorted  2xn  array,  one 
row  will  be  come  clean  with  all  zeros  (ones)  while  the  other  will  be  remain  dirty  (  a  combination  of 
Os  and  Is,  Figure  5-19).  When  pairs  of  rows  are  thus  sorted  in  the  nxn  array  and  the  column  sort 
is  carried  out  to  completion,  the  1 -clean  rows  fall  to  the  bottom  of  the  array  while  the  0-clean  rows 
float  to  the  top  of  the  array.  Thus,  exactly  half  of  the  array  is  cleaned  at  every  shearsort  iteration. 
This  completes  the  proof  outlines  as  a  sorted  array  in  one  with  only  clean  0-rows  on  top,  clean 
1-rows  at  the  bottom  and  a  single  sorted  ditty  row  separating  them. 

A  typical  embedding  for  shearsort  is  shown  in  Figure  5-20.  Note  that  the  structure  is 
reminiscent  of  the  2S-EON  and  shearsort  can  be  easily  mapped  into  the  two-stage  electro-optical 
network,  if  the  switches  are  capable  of  sorting  n-element  sequences.  In  Figure  5-21  such  an 
example  is  given  and  a  recursion  used  to  determine  the  number  of  2x2  switch  stages  needed  for  the 
network  to  sort. 

If  sequential  sorting  is  assumed,  the  total  sorting  time  in  the  network  is  0(nlog2  n)  compare- 
exchange  operations.  However,  if  each  switch  is  implemented  as  a  bitonic  sorting  network,  then 
the  total  sorting  time  would  become  0(log3  n) . 

The  importance  of  the  timing  results  above  lies  in  the  fact  that  routing  is  a  subproblem  of 
sorting,  and  the  sorting  complexity  of  any  network  is  an  upper  bound  for  routing.  Hence,  if  the 
two-stage  electro-optical  network  is  implemented  with  intelligent  switches  capable  of  some 
processing,  routing  can  be  effected  in  0(n\og2  n).  However,  if  the  network  is  implemented  with 
only  two-by-two  switches  arranged  such  that  each  switch  of  the  two-stage  network  is  an  n  -input 
bitonic  network,  then  routing  would  become  a  0(log3  n)  operation. 


ZERO-ONE  PRINCIPLE 

If  a  network  with  n  input  lines  sorts  all  2  sequences  of  0's 
and  l's  into  non-decreasing  order,  it  will  sort  any  arbitrary 
sequenec  of  n  numbers  into  non-decreasing  order. 


Figure  5-18.  Zero-One  principle. 
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Figure  5-19.  Proof  using  the  [0,1]  principle:  The  first  row  is  sorted  from  left  to  right 
and  the  second  from  right  to  left.  The  actual  number  of  o’ s  and  l’s  is  irrelevant.  After  the  column 
sort  one  of  the  two  rows  will  contain  only  0’s  or  only  1  ’s  depending  one  the  actual  number  ofO’s  and  1  ’s. 
In  the  most  favorable  case ,  when  both  are  equal ,  we  will  have  both  rows  ‘clean’. 
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Figure  5-21.  Mapping  the  shearsort  into  the  two-stage  electro-optical  network. 
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